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SOME  TEST  THEORY  FOR  TAILORED  TESTING 


Abstract 

In  a  tailored  test,  each  item  is  selected  for  administration  on  the 
basis  of  the  examinee's  responses  to  previous  items,  with  a  view  towards 
optimum  measurement  of  this  particular  examinee.  Various  simple  rules 
for  l)  selecting  the  items  to  be  administered  and  2)  scoring  the  examinee' 
responses  are  compared  and  evaluated.  Some  fundamental  ideas  emerge  that 
will  serve  as  guides  in  the  future  design  of  tailored  testing  programs. 


ADDENDUM  TO  ETS  RB-68-38 

A  practical  method  of  evaluating  Robbins -Monro  procedures  for  tailored 
tests,  without  using  Monte  Carlo  methods,  can  be  adapted  from  a  method  of 
W •  G«  Cochran  and  M.  Davis  described  in  "The  Robbins -Monro  method  for 
estimating  the  median  lethal  dose".  Journal  of  the  Royal  Statistical  Society. 
Series  B  (Methodological).  1965,  ££,  26-44.  Investigations  using  this  method 
are  under  way. 
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SOME  TEST  THEORY  FOR  TAILORED  TESTING* 


It  seems  likely  that  in  the  not  too  distant,  future,  many  mental  tests 
ni 11  be  administered  and  scored  by  computer.  Computerized  instruction 
will  be  common,  and  it  will  be  convenient  to  use  computers  to  administer 
achievement  tests  also  (Turnbull,  1968). 

The  computer  can  test  many  examinees  simultaneously,  with  the  same  or 
with  different  tests.  If  desired,  each  examinee  can  be  allowed  to  answer 
test  questions  at  his  own  rate  of  speed.  This  situation  opens  up  new 
possibilities.  The  computer  can  do  more  than  simply  administer  a  pre¬ 
determined  set  of  test  items.  Given  a  pool  of  precalibrated  items  to 
choose  from,  the  computer  can  design  a  different  test  for  each  individual 
examinee. 

An  examinee  is  measured  most  effectively  when  the  test  items  are 
neither  too  hard  nor  too  easy  for  him.  Thus,  for  any  given  psychological 
trait,  the  computer's  main  task  at  each  step  of  the  test  administration 
might  be  to  estimate  tentatively  the  examinee's  level  on  the  trait,  on  the 
basis  of  his  responses  to  whatever  items  have  already  been  administered.  The 
computer  could  then  choose  the  next  item  to  be  administered  on  the  basis  of 
this  tentative  estimate. 

Such  testing  has  been  called  "branched  testing,"  "programmed  testing," 
"sequential  item  testing,"  and  "computerized  testing."  Clearly,  the 
procedure  could  be  implemented  without  a  computer.  Here,  emphasizing  the 
key  feature,  we  will  speak  of  tailored  testing. 

*This  work  was  supported  in  part  by  contract  Nonr -2752(00)  between  the 
Office  of  Naval  Research  and  Educational  Testing  Service.  Reproduction, 
translation,  use  and  disposal  in  part  by  or  for  the  United  States  Government 
is  permitted. 
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It  should  be  clear  that  there  are  important  differences  between  testing  for 
instructional  purposes  and  testing  for  measurement  purposes.  The  virtue 
of  an  instructional  test  lies  ultimately  in  its  effectiveness  in  changing 
the  examinee.  At  the  end,  we  would  like  him  to  be  able  to  answer  every 
test  item  correctly.  A  measuring  instrument,  on  the  other  hand,  should 
not  alter  the  trait  being  measured.  Moreover,  as  already  noted,  measure¬ 
ment  is  most  effective  when  the  examinee  knows  the  answers  to  only  about 
half  of  the  test  items.  The  discussion  here  will  be  concerned 
exclusively  with  measurement  problems  and  not  at  all  with  instructional 
testing. 

Sections  3-6  contain  necessary  technical  preliminaries  for  formulating 
and  dealing  with  the  problem.  Section  8  discusses  key  questions 
in  evaluating  different  testing  procedures.  Sections  10,  12,  17,  and 
19  derive  and  present  mathematical  formulas  necessary  for  describing 
and  evaluating  various  tailored  testing  procedures.  Sections  9  a^d  12-19 
present  some  of  the  numerical  results  obtained  for  various  testing  procedures  in 
various  situations.  Sections  2  and  11  are  devoted  to  general  discussion. 

A  partial  summary  is  given  in  section  20. 

It  is  a  fortunate  fact  that  most  of  the  problems  dealt  with  here  closely 
parallel  similar  problems  in  bioassay.  Much  fruitful  work  has  been  done 
on  the  bioassay  problems.  This  provides  the  inspiration,  the  background, 
and  indeed  the  backbone  of  the  present  report.  A  brief  discussion  of  this 
bioassay  work  is  given  in  section  11  at  a  point  where  the  similarities  and 
differences  with  tailored  testing  problems  can  be  discussed  intelligibly. 

1.  A  Statement  of  the  ftroblem 

When  the  frequency  distribution  of  the  relevant  psychological  trait 
in  the  group  to  be  tested  is  well  known  from  previous  testings  of  similar 
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groups,  a  Bayesian  analysis,  using  group  statistics,  is  appropriate.  Such 
an  analysis  would  reach  different  conclusions  depending  on  the  frequency 
distribution  of  the  trait.  The  present  exploratory  treatment  will  not  use 
Bayesian  analysis.  Here  we  will  be  concerned  throughout  with  the  problem 
of  "measuring"  a  single  examinee  with  respect  to  one  psychological  dimension. 
Since  each  examinee  is  to  be  considered  by  himself,  group  statistics  (for 
example,  test  reliability  coefficients)  will  play  only  a  very  marginal  role. 

The  notion  of  "measuring"  an  examinee  implies  that  there  is  some 
numerical  value  6  ,  say,  characterizing  him,  which  we  wish  to  determine 
or  estimate.  The  data  available  for  making  this  estimate  will  be  the 
examinee's  responses  to  whatever  test  items  are  administered  to  him.  The 
basic  problem  is  to  choose  n  test  items  for  administration  so  that  his 
n  responses  will  enable  us  to  estimate  9  as  efficiently  as  possible. 

The  optimum  set  of  n  items  for  this  purpose  depends  on  the  unknown 
value  of  9  .  Because  of  this  fact,  it  is  not  clear  that  an  optimum  strategy 
exists,  independent  of  the  unknown  6  ,  for  choosing  the  desired  n  items. 

In  any  case,  we  will  not  even  attempt  here  to  find  an  optimum  strategy. 
Instead,  we  will  try  to  evaluate  certain  available  simple  strategies  with 
a  view  to  learning  which  of  these  are  superior  to  others  and  what  considera¬ 
tions  seem  relevant  for  determining  their  various  virtues. 

2.  Some  Strategies 

Current  research  in  tailored  testing  (see  Linn,  Rock,&  Cleary,  1969  for 
references;  also  Hansen  &  Schwarz,  1968)  is  typically  built  on  the  following 
rule.  If  the  examinee  answers  an  item  correctly,  the  next  item  administered 
should  be  harder;  if  he  answers  it  incorrectly,  the  next  item  should  be 
easier.  This  will  be  referred  to  as  the  branching  rule.  An  obvious  question 
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to  be  answered  here  is  how  much  the  item  difficulty  should  be  varied  from 
item  to  item*  A  second  question  of  strategy  is  how  to  score  the  responses 
once  the  items  have  been  administered.  Various  scoring  methods  have  been 
tried,  as  will  be  seen* 

Suppose  for  the  moment  that  n  ,  the  number  of  items  to  be  administered, 
is  indefinitely  large  and  that  the  branching  rule  is  used.  Suppose  further 
that  at  the  start  large  differences  in  difficulty  are  used  from  item  to  item 
and  that  these  differences  are  gradually  reduced  until  ultimately  successive 
items  are  of  nearly  equal  difficulty.  Will  such  a  strategy  allow  us  to 
pinpoint  the  item  difficulty  level  at  which  the  examinee  answers  exactly 
half  the  items  correctly,  in  the  long  run?  If  so,  we  can  characterize  the 
examinee's  ability  level  in  terms  of  this  difficulty  level. 

The  process  just  described  is  a  Robbins- Monro  process  (Robbins  8b  Monro, 
1951)-  Conditions  for  its  convergence  to  the  desired  value  are  not  difficult 
to  satisfy  in  practice.  The  entire  process  will  be  discussed  in  section  19» 

The  point  to  be  made  here  is  that  the  practical  use  of  the  branching 
process  to  estimate  the  examinee's  ability  does  not  require  strongly 
restrictive  assumptions.  It  is  not  necessary  for  this  purpose  to  know 
the  exact  mathematical  form  of  the  dependence  of  item  response  on  examinee 
ability  6  and  on  parameters,  such  as  item  difficulty,  describing  the  item. 

If  we  wish  to  evaluate  and  compare  the  efficiency  of  different  methods 
for  estimating  examinee  ability,  however,  it  becomes  necessary  to  have  some 
further  information.  In  any  one  particular  case,  this  information  could  be 
gathered  by  exhaustive  testing  of  the  particular  examinee,  provided  this 
testing  could  be  done  without  changing  him  in  the  process.  For  purposes 
of  the  present  paper,  in  order  to  generalize  our  conclusions  to  as  yet  untested 
populations  of  examinees,  we  will  instead  make  assumptions  about  the 
characteristic  curves  of  the  test  items. 
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3.  Item  Characteristic  Curves 

An  item  characteristic  curve  represents  the  probability  of  a  correct 
answer  to  an  item  as  a  function  of  the  trait  6  being  measured.  If  the 
item  is  scored  zero  or  one,^  this  curve  is  automatically  also  the  regression 
of  item  score  on  6  .  Ice's  are  important,  first  of  all,  because  they 
enable  us  to  quantify  important  characteristics  of  individual  test  items. 
Secondly,  because  they  e  ible  us  to  predict,  probabilistically,  how  the 
examinee  will  respond  to  any  chosen  item. 

Estimated  characteristic  curves  are  shown  in  Figure  1  (reproduced  from 
lord,  1968a)  for  five  actual  test  items.  All  these  curves  have  the  typical 
ogive  shape  with  an  upper  asymptote  at  (probability  of  a  correct 
answer)  equals  1.0  and  a  lower  asymptote  at  Pi  *  c^  ,  0  <  cjL  <  1  ,  where 

c^  is  a  parameter  characterizing  item  i  . 

The  solid  curves  shown  are  all  logistic  functions.  When  c^  =  0  , 
the  logistic  function  is  simply 

pi  "  Ve)  g  1  +  exp  [-  l.Ta^e  b'J]  »  (-»  <  e  <  ”)  > 

where  and  b^  are  parameters  describing  the  test  item.  (The  symbol 

*==  is  used  here  and  elsewhere  to  indicate  a  definition.) 

When  test  items  can  be  answered  correctly  by  random  guessing,  then 
Pi  >  0  for  all  6  and  c^  >  0  .  In  this  case,  we  sometimes  use  the 
three -parameter  logistic  function 


HP 
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1  -  ci 

Pl.(e)  =  ci  +  1  +  exp  [-1.7ai(e  -  b±)] 


*  ,  (-«  <  6  <  «)  . 


(2) 


This  is  the  function  shown  by  the  solid  curves  in  Figure  1.  Logistic  ice's 
are  discussed  in  detail  by  Birnbaum  (1966). 

The  logistic  function  in  (l)  is  in  some  ways  a  close  approximation  to 
the  normal  ogive 

a^S-bj) 

P1(e)  «  <*■[ a.A(e  -  b±) ]  =1  J  «(u)  du  ,  (-00  <  e  <  »)  ,  (3) 

-00 

where  is  defined  by  (?)  and 


♦(u)«* ^5j=  exp  (-  |  u2)  .  (4) 

(l)  and  (3)  do  not  differ  by  as  much  as  .01  for  any  value  of  6  (the  ratio 
of  (1)  to  (3)  is  Large  in  the  tails,  however).  When  guessing  occurs,  (3) 
is  replaced  by 

P.(e)  -  C.  +  (1  -  c.)  *[ai(0  -  b.)J  ,  (-«  <  0  <  °°)  .  (5) 


We  assume  here  that  (i),  <*)  ,  (?).  or  (?)  holds  true  for  a  given 
examinee  regardless  cf  any  knowledge  that  may  be  available  about  his  perfor¬ 
mance  on  items  other  than  item  i  .  This  means  that  when  0  is  fixed, 

the  probability  of  the  examinee  answering  a  fixed  set  of  n  items  correctly 
n 

is  simply  II  P.(&)  ,  the  product  of  the  separate  probabilities. 
i=l  1 

A  common  question  is  whether  it  may  be  possible  by  empirical  studies 
to  establish  the  superiority  of  either  the  logistic  or  the  normal-ogive  model 
for  ice's.  The  appropriate  answer  to  this  question  is  probably  that  if  an 


. 


-7- 

empirical  study  could  be  made  sensitive  enough  to  discriminate  between 
these  two  models,  it  would  almost  surely  be  sensitive  enough  to  prove  that 
neither  model  was  strictly  correct. 

Except  where  otherwise  noted,  the  present  paper  assumes  that  all  ice's 
are  of  the  form  (3)  or  (5).  Similar  assumptions  have  proven  to  be  very 
valuable  in  bioassay  work.  The  reader  who  feels  uncomfortable  with  such 
assumptions  should  consider  the  report  by  Lord  (1968),  in  which  (2)  was 
found  to  agree  closely  with  other  estimates  of  ice's  obtained  without 
prior  assumption  regarding  their  mathematical  form.  These  latter  estimates 
are  shown  in  Figure  1  by  the  curved  dashed  lines. 

U .  Item  Parameters 

The  parameters  a^  ,  b^  ,  and  c^  will  be  used  to  describe  items. 

In  this  report,  we  are  primarily  concerned  with  the  problem  of  selecting 
items  to  be  administered.  This  will  be  done  on  the  basis 
of  their  parameters,  determined  in  advance  by  pretesting.  Thus  it  will 
be  worthwhile  here  to  examine  the  meaning  of  these  parameters. 

As  already  noted,  c^  determines  the  lower  asymptote.  Accordingly, 

0  <  c^  <  1  .  Any  examinee,  however  low  his  6  ,  has  a  chance  >  of 
answering  the  item  correctly.  Thus,  c^  will  here  be  thought  of  as  the 
probability  of  chance  success  on  item  i  as  a  result  of  random  guessing. 

All  of  the  curves  (l),  (2),  (3),  and  (5)  have  an  inflexion  point,  which 
is  also  a  center  of  symmetry  for  the  curve.  The  parameter  b^  gives 
the  abscissa  of  the  inflexion  point,  which  is  at  ©  =  b^  .  We  will  allow 
to  assume  any  value  in  the  range  <  b^  <  «  .  It  is  clear  that 

b^  represents  the  difficulty  of  the  item.  The  larger  the  value  of 
b^  ,  the  less  likely  an  examinee  is  to  answer  the  item  correctly. 
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The  value  is  closely  related  to  the  slope  of  the  icc  at  the  in¬ 
flexion  point*  For  the  three -parameter  logistic,  this  slope  is  .425(1  -  c^Ja^  > 
for  the  three -parameter  normal  ogive,  it  is  *5989(1  -  c^)eL±  • 
parameter  a^  is  spoken  of  as  representing  the  discriminating  power 
of  the  item*  For  a^  >  0  ,  the  larger  the  value  of  a^  ,  the  more  the  item 
discriminates  between  examinees  with  high  0  and  examinees  with  low  0  • 

We  plan  to  use  items  that  axe  positively  correlated  with  0  ;  consequently 
we  shall  restrict  a^,  to  the  range  0  <  aA  <  «  . 

In  this  report,  we  assume  that  all  items  have  been  calibrated  --  the 
item  parameters  estimated —  by  pretesting  in  advance  of  any  use  we  make 
of  them.  How  this  is  done  is  not  our  problem  here,  but  we  will  glance 
at  it  briefly.  Presumably,  over  a  period  of  years  or  decades,  a  large 
pool  of  items  with  accurately  estimated  parameters  will  gradually  be 
accumulated* 

The  item  parameters  apparently  can  be  estimated  by  maximum  likelihood 
(Lord,  1968b;  Bock,  196?;  Birnbaura,  1968,  section  17*9) •  This  is  at 
present  a  costly  and  hazardous  operation.  The  parameters  can  be  approximated 
by  more  familiar  procedures,  which  we  mention  below.  These  approximations 
are  based  on  the  assumption  of  normal  ogive  ice's  with  c^  =  0  and  on  the 
rather  unlikely  assumption  that  0  is  normally  distributed  in  the  group  tested. 
Since  the  unit  of  measurement  used  to  express  0  is  arbitrary,  we  will  choose 
it  so  that  =  1  . 

Under  the  assumptions  stated,  the  parameter  a^  can  be  estimated 
with  the  help  of  the  relation  (Lord  fit  Novick,  1968,  section  16.10): 


(6) 
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where  is  the  biserial  correlation  of  item  response  with  9  •  In 

order  to  estimate  from  observable  quantities,  we  can  use  the  fact 

that  it  is  also  the  loading  of  item  i  on  the  common  factor  of  the 
tetrachoric  item  intercorrelation  coefficients  p^  •  We  note  that  a^ 
is  a  monotonic  increasing  function  of  .  Also  that 


aJ 


0iV - r 

1  +  ai 


a. 


a. 


•  = 


id  y 


1  +  aj 


y 


1  +  a 


(7) 

(8) 


Under  the  same  assumptions,  bA  can  be  estimated  by  using  its  relation 
to  ,  the  proportion  of  correct  answers  to  item  i  (Lord  &  Novick, 

1968,  section  16.9): 


-  -i  \T~ 

ai  1  +  8i 


2  «*1(#1)  , 


(9) 


where  1  is  the  inverse  of  the  normal  ogive  function  defined  in  (3). 

To  keep  matters  simple  in  this  preliminary  survey,  we  will  assume 
throughout  the  remainder  of  this  report  that  all  items  available  for  a 
particular  test  or  testing  have  the  same  value  of  ,  and  that  all  have 
the  same  value  of  •  Thus  items  will  be  chosen  for  administration 
solely  according  to  their  difficulty,  as  represented  by  the  parameters  b^  . 
It  will  be  found  that  this  simplification  does  not  by  any  means  make  our 
problem  a  trivial  one. 

Before  proceeding,  let  us  write  down  a  formula  that  will  help  us 
to  interpret  a.^  in  terms  of  more  familiar  test-theory  statistics.  Suppose 
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all  items  are  equivalent,  i.e.,  all  items  have  the  same  icc;  then  all  inter¬ 
item  tetrachoric  correlations  pj^  are  the  same.  If  also  all  b^  =  0  and 
all  ■  0  ,  then,  assuming  a  normal  distribution  of  6  and  a  normal  ogive 
icc,  the  interitem  phi  coefficient  (product  moment  correlation  between 
dichotomously  scored  items)  can  be  expressed  as  follows  (Lord  &  Novick,  1968, 
eq.  15«9«3): 

p«  - 1 arosln  pid  *  (10) 


the  angle  being  expressed  in  radians.  By  (8),  since  the  items  are  equivalent, 

2 

pii  “  I arcaln  •  (id 

By  the  Spearman -Brown  formula,  the  reliability  of  the  number-right  score  on 
a  test  composed  of  n  equivalent  items  is 


P 


1  +  (n  -  ljp 


(12) 


If  a^  =  .533  ,  under  the  assumptions  already  made  this  reliability  for  a 
60-item  test  will  be  .80;  if  ai  =  .5  ,  this  reliability  will  be  .90;  if 
a^  =  1.0  ,  this  reliability  will  be  .97*  In  view  of  this,  we  will  choose 

a^  =  .5  as  a  typical  value  and  will  address  most  of  our  attention  to  it. 

The  following  table  will  help  the  reader  to  reinterpret  the  meaning 
of  b^  .  If  6  is  normally  distributed  with  a  mean  of  0  and  a  standard 

deviation  of  1 ,  then,  under  the  normal  ogive  model,  the  proportion  of 

correct  answers  given  by  the  group  of  examinees  to  an  item  with  = ,5  is 
shown  in  the  following  table: 
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bi- 

-3.0 

-2.0 

-1.0 

0 

1.0 

2.0 

3.0 

ci  =  0 

•91 

.81 

.67 

•50 

.33 

.19 

.09 

OJ 

• 

II 

■H 

O 

•93 

.85 

.Jk 

.60 

.46 

•  35 

.27 

Stochastic 

Processes 

and 

Random  Walks 

As  foreshadowed  earlier,  all  the  testings  to  be  considered  in  the  present 
report  proceed  one  item  at  a  time,  as  follows.  After  administration  of  the 
first  item,  each  subsequent  item  is  picked  for  administration  by  some  predeter 
mined  rule,  solely  on  the  basis  of  the  examinee's  response  to  the  preceding 
item.  The  choice  among  items  is  made  entirely  in  terms  of  item  difficulty,  b 
Let  the  superscript  v  =  1, 2, 3,  •  •  •  refer  to  the  order  in  which  the  items 
are  administered,  so  that  item  v  +  1  is  the  item  administered  immediately 
after  item  v  . 

Now,  the  origin  and  unit  of  measurement  in  which  b  and  6  are 
expressed  is  purely  arbitrary—  it  is  easily  seen,  for  example,  that 
adding  a  constant  to  b  in  (l),  (2),  (3),  or  (5)  while  subtracting  the 
same  constant  from  0  will  have  no  effect  on  the  item  characteristic 

curve.  Since  we  are  free  to  choose  an  origin,  we  shall  place  it  at  b^^ 
so  that  hereafter 


(13) 


unless  specifically  stated  otherwise. 

In  general,  after  a  successful  response  we  will  want  b^v+1^  >  b^ 
after  an  unsuccessful  response,  b^v+1^  <  b^  .  (Conceivably,  blocks  of 
items  may  be  substituted  for  single  items  in  this  scheme,  with  some 
elaboration  of  the  rule  for  choosing  each  successive  block.) 
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Glearly,  after  the  first  item,  the  value  of  b^*1^  ,  v  *  1,2,3,  •••  , 

is  a  chance  variable*  The  frequency  distribution  of  b^v+1^  depends, 

in  accordance  with  (l),  (2),  (3),  or  (5),  on  the  value  of  b'v^  and  on 

the  examinee's  value  of  6  (considered  as  fixed)*  Such  a  sequence  of 

random  variables  is  called  a  stochastic  process.  Furthermore,  this 

(v) 

process  has  the  Markov  property:  Once  bx  9  is  known  for  a  given  6  ,  the 
probability  of  any  value  of  b^v+1^  is  independent  of  the  values  of 
b^,b^,  •**,b^v”1^  •  Thus  for  a  given  examinee  the  random  variable  b^ 
constitutes  a  Markov  process. 

(v) 

By  what  rule  should  the  successive  values  of  bv  9  be  chosen?  A 
plausible  branching  rule  would  be  the  following*  After  administering  item  v  , 
compute  the  maximum  likelihood  estimate  of  6  •  Choose  b^V+1^  equal  to 
this  estimate*  Do  this  for  v  =  1,2,3,  •••  • 

For  a  fixed  set  of  items,  it  is  not  difficult  (at  least  not  for  a 
computer)  to  obtain  a  maximum  likelihood  estimate  of  6  from  the  likelihood 
function 

n  u  1-u 

L(ul,u2,  *  *  *,un)  “  n  V0)  V0)  V>  (1*0 

where  Py(e)  is  given  by  (l),  (2),  (3),  or  (5)  and  ^(e)  =  1  -  Py(e)«  Now 
for  a  fixed  set  of  items,  the  values  of  the  item  difficulties  are  fixed  and 
known;  but  for  any  stochastic  process,  they  are  random  variables.  This 
complicates  the  problem  to  such  a  point  that  we  shall  not  attempt  to 
evaluate  the  results  obtained  when  the  stochastic  process  itself  depends  on 
successive  maximum  likelihood  estimation. 

At  this  point,  let  us  consider  just  one  very  simple  kind  of  Markov 
process.  This  process  is  well  known  in  bioassay  work  as  the  up-ard-down 
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method.  In  our  terms,  if  the  examinee  answers  item  v  correctly,  then 
(v+l)  (v) 

we  choose  b'  *  =  b'  7  +  d  ,  where  d  is  some  step  size  that  we  pick 

(y+l  )  fy) 

in  advance;  if  he  answers  item  v  incorrectly,  then  b'  '=bw-d. 

(y) 

It  is  apparent  that  in  the  up-and-down  method,  the  random  variable  b'  ' 
can  take  on  the  values 


b 


Jd 


(15) 


where  J  Is  a  (possibly  negative)  integer.  Actually,  J  equals  the  number 
of  correct  responses  minus  the  number  of  incorrect  responses  for  the  first 
v  -  1  items.  We  see  that 


b<v+1>= 


b^  +  d  with  probability  P(8  -  b^)  , 
-  d  with  probability  (,,(8  -  b^)  , 
any  other  value  with  probability  0  , 


(16) 


where  the  notation  P(8  -  b^)  «  Pv(e)  is  used  in  order  to  display  the 

(v) 

role  of  the  item  parameter  b'  *  • 

A  Markov  process  in  which  the  random  variable  can  take  only  a 
denumerable  set  of  values  (as  is  the  case  here,  see  (15))  is  called 
a  Markov  chain.  A  Markov  chain  satisfying  (l6),  for  specified  values  of 
P  and  Q  is  a  random  walk.  The  P  *  s  and  Q  '  s  are  called  transition 
probabilities.  The  fact  that  for  fixed  8  -  b^  ,  P(8  -  b^v^)  does 
not  depend  on  v  is  customarily  described  by  saying  that  the  transition 
probabilities  are  stationary. 

Now, 

b^  =  b^  +  d  2  (2u  -  1)  . 

r=l  r 


Let  ue  note  in  passing  that  the  likelihood  function  of  the  item  responses  u 

6 

under  the  up-and-down  method  hi 

n  u  /,  \  v-1  1-u  /.  v  v-1 

H  P  v[0  -  b  -  d  £  (2u  -  1)}Q  v[6  -  b^'d  £  (2u  -  l)]  . 

v=l  r=l  r=l 


6.  Scoring  Methods 

To  set  up  a  tailored  testing  operation,  we  must,  in  effect,  choose 
not  only  a  stochastic  process  but  also  a  scoring  procedure.  For  the  most 
part,  we  shall  consider  Just  three  simple  possibilities,  all  of  which  have 
been  used  in  experimental  work  on  tailored  testing.  Any  one  of  the 
different  scores  will  be  denoted  by  x  •  The  total  number  of  items  to 
be  administered  to  a  given  examinee  is  denoted  by  n  ;  this  is  assumed 
fixed  in  advance  of  the  testing.  Let  =  1  denote  a  "correct"  response 
to  item  i  ,  and  let  u^  =  0  denote  an  incorrect  response. 

1.  This  score  is  the  number  of  items  answered  correctly: 

n 

x  =  E  u  .  (17) 

v-1  T 

This  is  the  conventional  number-right  score. 

2.  This  score  is  the  difficulty  of  the  item  that  would  have  been 

administered  to  the  examinee  after  the  n  -th  item: 


x 


b(n+i) 


b  +  a  if  u  =  l , 
n 

b^  -  d  if  u  =  0  . 

n 


(18) 


It  will  be  referred  to  as  the  final  difficulty  score. 


-15- 


3*  This  score  is  the  average  of  the  item  difficulties,  excluding 
=  0  but  including  b^n+1^  (as  defined  by  eq*  lB): 

x  *=  -  2  b^  .  (19) 

v=2 

It  will  be  called  the  average  difficulty  score* 

To  start  with,  let  us  consider  the  situation  where  we  use  the  simplest 
of  these  scores,  the  final  difficulty  score,  in  conjunction  with  the 
up -and -down  method  of  selecting  items* 

7*  Up-and-Down  Method  with  Final  Difficulty  Score 

(v) 

Since  b'  forms  a  Markov  chain  under  the  up-and-down  method,  standard 
formulas  are  available  for  finding  the  sampling  distribution  of  the  score 
x  =  b^n+1^  for  any  specified  n  .  These  are  outlined  in  the  Appendix. 

For  the  present  purpose  of  evaluating  this  particular  tailored  testing  pro¬ 
cedure,  we  will  not  need  the  entire  frequency  distribution  of  x  =  b^n+1^  , 
but  we  will  need  its  expectation  and  sampling  variance. 

We  need  to  improve  our  notation  at  this  point.  Let  us  write  V>,e) 
to  denote  the  score  after  administering  v  items  when  the  difficulty 
of  the  first  item  administered  is  b  and  when  the  ability  level  of  the 
examinee  is  6  .  The  score  Xv(b,6)  is  a  random  variable  whose  distribution 
depends  on  v  ,  b  ,  and  9  .  The  distribution  also  depends  on  d  and 
on  the  item  parameters  a  and  c  ,  although  these  are  not  explicit  in  our 
notation.  In  actual  practice,  the  first  item  administered  has  b  =  0  , 
as  already  explained  in  section  5*  For  the  derivation,  however,  we  will 
need  to  consider  different  possible  values  of  b  . 


If  the  first  item  administered  is  answered  correctly,  the  second 
item  in  the  up-and-down  method  is  picked  to  have  a  difficulty  of  b  +  d 


By  virtue  of  the  Markov  property,  once  X^(b,0)  is  fixed,  subsequent 
performance  does  not  depend  on  the  examinee's  response  to  Item  1.  Thus 
when  the  first  item  is  answered  correctly,  Xy+1(b,0)  =  Xv(b  +  d,0)  for 
v  *  1, 2, 3,  •  •  •  • 

A  similar  analysis  can  be  made  for  the  case  where  the  first  item  is 
answered  incorrectly.  Thus  we  can  write  for  v  *  1, 2, 3,  •  •  • 


Wb'e> 


t(b  +  d,e)  with  probability  P(6  -  b)  , 
(b  -  d,0)  with  probability  Q (6  -  b)  • 


This  equation  and  similar  ones  derived  by  the  same  line  of  reasoning  are 
fundamental  to  most  of  our  practical  results.  It  provides  a  relationship 
connecting  the  random  variable  Xy+1  with  the  random  variable  Xy  , 
allowing  us  to  compute  necessary  quantities  for  recursively. 

Let 


Gv(b,e)  xv(b,e)  -  e 

denote  the  error  in  the  score  Xy  ,  and  write  t  =«  6  -  b  •  Then,  from 
(20  )  for  v  *=  1,2,3,... 


ov+1(b,e) 


Gv(b  +  d,0)  with  probability  , 

i 

Gv(b  -  d,0)  with  probability  , 


where  we  write  P^  instead  of  P(6  -  b)  .  The  values  of  P  and  Q  are, 
as  always,  to  be  computed  from  (l),  (2),  (3),  or  (5)  •  In  particular,  we 


mmmmm 


-17- 

see  from  (20)  that 

d  -  t  with  probability  , 

G,(b,e)  =  (23) 

-  d  -  t  with  probability  . 

The  bias  of  X^.  is  the  expectation  of  Gv  for  given  v  and  6  , 
which  we  denote  by 

Ev(b)  ==  CGv(b,e)  .  (24) 

Although  E  (b)  Is  a  function  of  8  ,  we  will  omit  the  symbol  8  to  keep 
the  formulas  simple.  From  (22),  for  v  -  1,2,3, , 

Ev+i(b)  =  PtEv(b  +  4)  +  Ev(b  -  d)  .  (25) 

From  (23), 

\0>)  =  (d  -  t)Pt  -  (d  +  t)0t  =  d(Pt  -  Qj.)  -  t(Pt  +  Qfc) 

=  d(l  -  20t)  -  t  .  (26) 

The  bias  of  Xn  for  given  n  and  6  can  be  computed  recursively  using 
(26)  and  (25)* 

To  get  the  sampling  variance  of  Xn  •---  Xn(b,e)  ,  we  start  by  defining 
Wv  as  the  expected  mean  square  error  of  Xv  for  a  given  9  : 
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WV  -  «v  '  e<\  -  ®)2  *  (27) 

where  for  convenience  we  omit  the  arguments  in  parentheses .  Thus,  by  (22), 
Wv+1(b/  -  Pt«v(b  +  d)  +  Q^Cb  -  d)  .  (28) 

Also,  by  (23), 

\  (b)  -  (d  -  t)2Pt  +  (d  +  t)2^  .  (29) 

The  sampling  variance  of  is  given  by 

4  |*  “  «£  *  <«/  -  ron  *  -  Vb)  '  [Vb>]2  •  <»> 

n 


For  future  use,  let  us  write  down  here  one  more  recursion  relation, 
enabling  us  to  find 

dv0>)  ==^|exy(b,e)  .  (31) 

From  (20 ), 

env+1(b,e)  =  Ptexv(b  +  d,8)  +  <ltexv(b  -  a,e)  , 


so  that 


Drt(k)  -  PtDv(b  +  d)  +  ^(b  -  d)  +  (Ey(b  -  d)  -  Ey(b  +  djjjj-  .  (32) 
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Also,  from  (20 )  and  (13) 


€X1(b,0)  =  Pt(b  +  d)  +  Qt(b  -  d) 


so  that 

do 

^(b)  =  -  2d  (33) 

If  the  icc  is  a  normal  ogive  as  in  (3),  then 

dQ. 

55”  =  -  V[ai^e  '  bi^  * 

8.  Evaluation  of  Estimates  of  6 

Before  giving  numerical  results  evaluating  tl.e  up-and-down  method 
with  final  difficulty  score,  it  is  necessary  to  consider  at  some  length 
just  how  such  results  should  be  evaluated.  This  is  not  a  trivial  matter. 

Empirical  research  studies  (see  Linn,  Rock,  &  Cleary,  1969,  and  other 
references  given  there;  also  Hansen  &  Schwarz,  19&£)  have  often  used  the 
correlation  of  tailored -test  score  x  with  some  outside  criterion  to  evaluate 
the  effectiveness  of  testing  and  scoring  procedures.  If  a  particular 
examiner  has  repeatedly  tested  similar  groups,  he  may  know  approximately 
in  advance  the  distribution  of  6  in  the  next  grouj  he  plans  to  test; 
such  an  examiner  may  well  use  group  statistics  and  Bayesian  methods.  Although 
these  kinds  of  evaluation  have  obvious  face  validity,  they  will  not  be 


used  here  for  at  least  two  reasons : 
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1.  The  correlation  coefficient  is  a  group  statistic,  whereas  the 
problem  here  is  to  determine  the  accuracy  with  which  we  can  measure 

a  single  individual.  Important  information  about  the  accuracy  obtain¬ 
able  for  specific  individuals  is  lost  when  this  information  is  pooled 
over  individuals  to  get  a  group  statistic. 

2.  In  order  to  obtain  a  group  statistic,  we  would  have  to  make  some 
assumption  about  the  frequency  distribution  of  6  in  the  group 
studied.  Such  an  assumption  would  prevent  easy  generalization  of  our 
results  to  groups  with  substantially  different  distributions  of  6  .  (The 
few  results  available  to  date  (see  Lord,  1968b)  run  against  the  convenient 
assumption  that  6  is  likely  to  be  approximately  symmetrically  distributed 
at  least  in  the  case  of  highly  selected  groups -such  as  college  students.) 
The  gain  to  be  hoped  for  from  tailored  tests  arises  entirely  (or 

nearly  so)  from  tailoring  the  item  difficulties  to  the  ability  of  the 
examinee .  But  in  a  typical  test  that  is  not  too  heterogeneous  in  item 
difficulty,  most  of  the  items  are  already  well  tailored  to  the  abilities  of 
most  of  the  examinees.  Thus  tailored  testing  can  not  provide  greatly 
improved  measurement  for  most  examinees.  The  value  of  tailored  tests 
is  primarily  for  those  examinees  for  whom  the  conventional  test  would 
too  easy  or  too  difficult.  The  correlation  coefficient  over  the  entire 
group  of  examinees  is  not  a  good  index  for  judging  the  improved  measurement 
gained  by  a  minority. 

One  way  to  describe  %he  measurement  properties  of  the  score  x  is  by 
giving  its  standard  error  (the  square  root  of  the  sampling  variance 
i\e  eq*  30).  It  is  no  surprise  to  find  that  the  standard  error  depends 
on  the  unknown  value  of  6  .  The  measurement  properties  of  the  score  x 

cannot  be  summarized  by  a  single  number,  but  must  be  represented  by  a  curve _ 

a  function  of  0.  Thus  we  might  find,  for  example,  that  score  provides 
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more  accurate  measurement  than  score  Xg  for  a  certain  range  of  values  of 
6  ,  while  score  Xg  is  more  accurate  for  examinees  outside  this  range. 

In  bioassay  work  (e.g.,  Brownlee,  Hodges,  &  Rosenblatt,  1953),  En(b)  , 
the  bias  of  x  ,  must  be  taken  into  account.  This  is  done  by  using  the 
expected  me  an- square  error  ==  C(X^  -6)  to  describe  the  accuracy 
of  measurement.  As  shown  in  (30  ),  the  expected  mean-square  error  exceeds 
the  sampling  variance  by  the  square  of  the  bias. 

In  mental  testing,  on  the  other  hand,  the  scale  in  which  6  is 
measured  has  an  arbitrary  origin  and  unit  of  measurement.  Thus  a  constant 
bias  in  the  score  x  ,  or  even  a  bias  that  changes  linearly  with  6  ,  would 
not  impair  the  value  of  x  at  all.  To  carry  matters  further,  the  scale  for  6 
that  yields  (2)  or  (5)  is  highly  arbitrary.  Any  monotonic  transformation  of  this 
scale  would  be  defensible  for  measuring  the  examinee.  In  comparing  different 
tailored  testing  procedure^  we  must  use  an  index  of  effectiveness  that  always 
leads  to  the  same  conclusion  no  matter  what  monotonic  transformation  of  the  6 
scale  is  chosen. 

In  order  to  describe  the  effectiveness  of  the  score  x  for  measurement 
purposes,  we  will  use 


(The  reader  may  wish  at  this  point  to  glance  at  Figure  2,  which  shows 
^Ix(e)  for  certain  testing  procedures.)  This  is  the  quantity  recommended 

by  Birnbaum  (1968,  eq.  17.7.IO)  to  measure  the  "information"  in  the  score 
x  .  The  use  of  I  (e)  also  was  recommended  by  Lord  (195?,  eq.  57)  and,  in 

a  very  different  context,  by  Mandel  and  Stiehler  (195^-).  We  will  call 

I  (e)  the  information  function  for  the  score  x  . 
x'  - 
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A s  Birnbaum  shows,  in  large  samples,  1^(0)  is  Inversely  proportional 
to  the  square  of  the  length  of  the  confidence  interval  for  estimating  0 
from  x  •  Birnbaum  uses  this  information  function  for  small  as  well  as 
large  sazqples,  and  we  shall  do  so  here.  The  meaning  and  justification  of 
this  index  in  small  samples  are  well  described  by  Mandel,  whom  we  paraphrase 
closely  here: 

If  it  is  desired  to  differentiate  between  two  nearby  values,  0'  and 
0"  ,  by  means  of  the  corresponding  measurements  x*  and  x"  ,  it  is 
apparent  that  the  success  of  the  operation  will  depend  on  two  circum¬ 
stances:  (l)  the  magnitude  of  the  difference  C"  -  C'  =«  C(x"|0")  -  e(x'|0') 
for  a  given  difference  0"-  0'  ,  i.e.,  the  magnitude  of  the  slope 
(C"  -  C' )l(0"  -  0’)  ;  and  (2)  the  precision  of  measurement  . 

These  two  desiderata  can  be  combined  in  a  single  criterion,  the 
information  function,  defined  as  the  square  of  the  ratio  of  the  slope 
to  ox|0  . 

It  is  helpful  to  visualize  the  situation  with  the  aid  of  a  small  diagram: 


A  more  formal  discussion  of  the  small-sample  interpretation  is  given  by  lord  (1952). 

It  is  important  to  note  that  I  (  )  is  an  operator,  not  a  function. 

This  means  that  I  (a0)  must  be  found  from  the  definition  in  (54),  not 

I  (0)  and  then  substituting  a0  for  0  . 


by  writing  down 
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It  is  apparent  from  (34)  that  any  change  in  the  unit  used  to  measure 
6  does  change  I  (fl)  .  Thus  I  (6)  is  not  a  pure  number.  In  fact, 

X  X 

l/>/l  (0)  is  expressed  in  the  same  score  units  as  9  , 

A 

Suppose  that  a  monotonic  increasing  transformation  is  made  on  9  so 
*  * 

that  9  ==  9  (0)  replaces  9  .  Then  the  denominator  of  (34)  remains 

&9  2 

unchanged,  but  the  numerator  must  be  multiplied  by  (^r)  in  order  to 
find  Ix(e*)  .  Since  |jr  may  have  any  form,  it  is  possible  to  make 
I  (0*)  assume  any  shape  desired,  within  the  restriction  I  (  )  >  0  , 

X  X 


simply  by  a  suitable  choice  of  the  transformation  9 

The  conclusion  to  be  drawn  is  that  unless  we  are  willing  to  assert 
that  we  have  used  a  uniquely  appropriate  scale  for  measuring  9  ,  we 
cannot  draw  conclusions  from  the  shape  of  the  information  function.  This 
drastic  limitation  leads  to  no  difficulties  in  the  present  study  since 
we  shall  always  be  comparing  two  or  more  information  functions,  all  based 
on  the  same  scale  for  9  .  The  ratio  between  the  information  function 
for  x^  and  the  information  function  for  Xg  measures  the  relative 
efficiency  of  x1  and  for  estimating  9  .  The  relative  efficiency 

is  unaffected  by  any  differentiable  monotonic  increasing  transformation  of 

de 

9  ,  since  the  factor  appears  twice  and  cancels  out. 

In  studying  the  effectiveness  of  tailored  testing  procedures,  it  will 
be  helpful  if  we  can  compare  them  to  more  familiar  procedures.  To  keep 
matters  simple,  we  will  compare  the  information  function  of  the  tailored 
testing  procedure  with  the  information  function  of  the  number-right  score 
on  a  conventional  test  composed  of  n  equivalent  items  with  normal  ogive  icc 
and  with  all  b^  =  0  .  This  test  will  hereafter  be  referred  to  as  the 
standard  test  (use  of  number-right  score  is  to  be  understood).  The 


information  function  for  (number  right  score  on)  this  standard  test  is 


(Birnbaum,  1968,  eqs.  20.2.2,  20. 5. l) 


I(e)  =  mm 


(35) 


where  Pg(0)  is  the  derivative  of  P^(0)  with  respect  to  0  .  The  subscript 
x  has  been  dropped  on  the  left  of  (35)  because  for  tests  composed  of 
equivalent  items,  the  number-right  score  is  a  sufficient  statistic  for 
estimating  6  (Birnbaum,  1968,  section  lB.3*l);  consequently,  (35)  represents 
the  maximum  information  that  could  be  obtained  from  the  responses  to  the  n 
items  by  any  scoring  method. 

Since  these  n  items  are  ideally  suited  for  an  examinee  at  0  =  0  ,  it 
follows  that  no  information  curve  can  ever  be  higher  at  any  value  of  0  than 
the  value  given  by  (35)  when  0  =  0.  This  provides  a  horizontal  line  at  or 
below  which  all  information  curves  must  fall.  Note  that  this  limit  is  a 
result  of  the  assumption  that  a^^  is  the  same  for  all  items.  It  might  well  be 
found  in  practice  that  the  harder  items  tend  to  have  higher  or  lower 
than  the  easier  items,  in  which  case  the  limit  would  no  longer  be  a  horizontal 
straight  line.  The  limiting  curve  can  still  be  computed  from  (35)  if  the 
a^  are  known. 

It  is  worth  mentioning  that  the  information  function  for  the  number-right 
score  on  any  conventional  tes*  is  proportional  to  n  ,  the  length  of  the 

test ,  Thus,  a  percent  increase  in  any  information  function,  achieved  by 
whatever  means,  can  be  understood  as  an  increase  in  information  equivalent 
to  that  obtained  for  a  number-right' score  on  a  conventional  test  by  increasing 


the  number  of  test  items  by  the  same  percentage. 
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q.  Information  Functions  for  the 
Up-and-Down  Method  vlth 
Final  Difficulty  Score 

Even  when  we  restrict  attention  just  to  the  up-and-down  method  with 
final  difficulty  score,  there  is  still  quite  an  assortment  of  possibilities 
to  be  investigated.  Here,  we  restrict  our  attention  to  items  with  normal 
ogive  characteristic  curves  (for  some  results  under  the  logistic  model,  see 
section  13).  In  the  next  several  sections,  we  will  consider  only  the  case 
where  c  =  0  . 

This  still  leaves  us  with  four  parameters  to  take  into  consideration: 
a  ,  n  ,  d  ,  and  9  .  Figure  2  shows  information  functions  for  the  up-and- 
down  method  with  final  difficulty  score,  computed  from  (3*0,  (32),  and  (30), 
for  n  *  10  and  for  n  =  60  .  Figure  2  is  appropriate  for  any  value  of  a^ 
in  a  wide  range.  This  is  possible  because  dividing  a^^  by  a  constant  does 
not  change  the  value  of  the  icc  in  (3)  (or  in  (l),  (2),  or  (5),  for  that 
matter)  provided  and  9  are  multiplied  by  the  same  constant. 

If  ai  =  a  =  1  ,  we  are  likely  to  be  interested  in  the  range  of  9 
between  9  =  -  3/a  =  -  3  and  9  =  3/a  =  +  3  ,  say.  Since  we  have  set 
Oq  =  1  (see  section  k),  we  may  expect  that  not  too  many  people  will  lie 
outside  some  range  such  as  +  3o  .  That  part  of  Figure  2  beyond  a©  =  +  3 
is  shown  here  for  completeness;  however,  it  has  been  shaded  to  indicate  that 
it  is  probably  only  rarely  of  interest. 

If  we  consider  items  with  =  a  =  .50  instead  of  1.0,  then,  if  other 
things  are  the  same  as  above,  since  aQ  =  l  always,  we  shall  be  interested 
in  the  range  from  9  =  -  1.5/a  =  -  3  to  9  =  +  1.5/a  =  +  3  •  When  we 
halve  the  value  of  a^^  ,  we  must  double  the  value  of  b^  (as  well  as  of  9  ) 
if  Pi  is  to  remain  unchanged.  This  means  that  for  ai  =  a  =  1  ,  the 


*  'si's,  s'  * 
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curve  labeled  ad  =  1.0  represents  a  random  walk  procedure  with  step  size 
d  =  1.0,/a  =  1.0  ;  but  when  a^  =  a  =  0.5  ,  the  step  size  is  now  d  =  l/a  =  2.0  . 

Vertical  distances  in  Figure  2  and  in  all  similar  figures  are  proportional 
to  the  square  root  of  the  information  function.  Thus  vertical  distance  is 
directly  proportional  to  a  large -sample  approximation  for  the  length  of  the 
confidence  interval  for  estimating  0  from  test  score.  The  vertical 
scale  on  the  right,  however,  is  numbered  to  show  the  information  function, 
not  its  square  root.  Thus  the  numbers  given  on  the  scale  on  the  right 
are  proportional  to  the  number  of  test  items  required  to  produce  a 
corresponding  amount  of  information  by  conventional  testing  methods. 

The  two  solid  lines  labeled  n  =  10  and  n  =  60  represent  the 
information  functions  of  a  10-item  and  a  60-item  standard  test  (see 
section  8),  as  given  by  (35)  •  If  we  read  the  heights  of  these  curves 
on  the  right-hand  scale,  we  can  confirm  that  the  60-item  standard  test  gives 
exactly  6  times  as  much  information  as  the  10-item  standard  test,  regardless 
of  the  value  of  6  . 

The  numerical  results  shown  in  Figure  2  and  in  subsequent  figures  were 
obtained  by  programming  a  computer  to  evaluate  equations  (25),  (28),  and  (32) 
recursively  for  v  =  1,2,  ...,59*  All  results  obtained  recursively,  both 
here  and  in  later  sections,  were  independently  checked  up  through  n  =  10 
(thus  checking  the  formulas  also)  by  an  entirely  separate  computer  program 
that  computed  the  probability  of  each  of  the  2°  possible  patterns  of  response 
to  n  =  10  items,  computed  the  score  for  each  pattern,  and  then  computed  the 
mean  and  the  variance  of  these  scores  over  a  LI  possible  patterns,  and  also 
the  derivative  Dn(b)  .  This  latter  method  of  computing  results  cannot  be 
extended  much  beyond  n  =  10  or  12  . 

The  details  of  interpreting  Figure  2  are  considered  at  some  length 
here  because  most  subsequent  numerical  results  will  be  presented  similarly. 


Hereafter,  the  verbal  interpretation  will  be  for  the  most  part  limited  to  the 
case  where  a^  *  a  ■  .50  ,  in  order  to  keep  the  presentation  from  getting 
out  of  hand. 

The  scalloped  effect  when  the  step  size  is  2.0  (when  ad  *  1.0  ) 
is  due  to  the  fact  that  when  n  is  even,  the  difficulty  of  the  (n  +  l)  -th 
item  must  be  an  odd  multiple  of  2.0.  Thus  examinees  at  0  =  +  2.0  or 
at  0  *  +  6.0  are  measured  more  accurately  than  those  at  intermediate 
levels.  Actually,  all  curves  shown  here  and  in  subsequent  figures  should 
be  scalloped.  However,  the  effect  is  hardly  noticeable  for  smaller  values 
of  d  and  has  been  ignored  in  drawing  the  figures. 

When  step  size  is  d  =  2.0  ,  the  amount  of  information  obtained  within 
the  range  -  8  <  a0  <  +  8  is  not  appreciably  increased  by  additional  testing 
after  the  first  10  items—  the  same  curve  adequately  represents  the  information 
functions  for  both  n  =  10  and  n  =  60  .  Clearly  the  step  size  is  too 
large  to  provide  accurate  measurement.  The  same  is  true  for  d  =  1.0 
(ad  =  0.5)  within  the  range  -3  <  6  <  +  3  but  not  outside  this  range. 

When  step  size  is  reduced  to  d  =  0.10  (ad  =  0.05)  ,  the  information 
obtained  near  6  =  0  is  greatly  increased,  less  so  for  6  at  +3*0 
(a0  =  +  1.5)  . 

For  a  =  0.5  ,  the  10-item  tailored  testing  procedure  with  d  =  .10  is  better 
than  the  standard  test  for  almost  all  6  .  For  n  =  60  ,  the  standard  test 
is  better  at  0  =  0,  but  its  effectiveness  falls  off,  so  that  at  0  =  +  3*0 
(a0  =  1.5)  ,  it  provides  only  about  80  percent  as  much  information  as  does  the 
tailored  test  with  d  =  0.10.  We  see  here  the  broad  outlines  of  a  basic  problem 
in  tailored  testing.  We  need  a  small  step  size  to  compete  with  the  accuracy 
of  measurement  provided  by  the  standard  test  for  typical  individuals  (near 
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of  atypical,  individuals  (at  9  =  +  3  ,  say).  The  optimal  step  size  in  any 
situation  of  course  depends  on  a^  ,  on  n  ,  and  on  the  accuracy  of  measure¬ 
ment  required  at  different  levels  of  9  . 

Note  that  no  information  curve  can  ever  be  higher  at  any  value  of  6 
than  the  maximum  of  the  information  curve  for  the  standard  test. 


10 »  The  Up-and-Down  Method 
with  Number-Right  Score 

Keeping  the  same  random-walk  method  of  sequencing  items,  let  us  see  to 

what  extent  we  get  effective  measurement  when  we  change  the  examinee*  s  score 

from  b^n+^  ,  the  difficulty  of  the  (r  +  l)  -th  item,  to  2uv  ,  the  total 

number  of  items  answered  correctly.  This  number-right  score  has  been  used 

in  experimental  studies  of  tailored  testing.  Denoted  by  Xn  or  by  x  , 

it  is  the  score  referred  to  throughout  this  section,  unless  otherwise 

specified.  The  reader  may  wish  before  reading  further  to  form  his  own 

judgment  as  to  the  relative  effectiveness  of  the  scores  b'n  '  and  £u^  . 

v 

Let  X  (b,0)  now  denote  the  score  2  u  obtained  in  the  up-and-down 
V  r=l  r 

method  when  the  difficulty  of  the  first  item  administered  is  b  and  when 

the  ability  of  the  examinee  is  9  .  As  in  section  7,  we  have  a 

recursion  equation  for  the  random  variable  X  : 


Wb’e)  - 


Xy(b  +  d,e)  +  1  with  probability  P(0  -  b)  , 


Xy(b  -  d,@)  with  probability  Q (6  -  b)  , 


(36) 


for  v  =  1,2,2,...  •  From  this  we  can  derive  equations  for  GXv+1(b, 9)  , 
GX^iOa,©)  ,  and  for  similar  to  (25),  (28),  and  (32).  The 


for 
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error  variance  0^  can  be  computed  from  the  first  line  of  (30 ).  The 

n 

information  function  can  be  computed  from  (3*0* 

We  find  that  the  bias,  the  mean  squared  error,  and  the  sampling  variance 
n 

of  £uv  are  (not  surprisingly)  very  different  from  the  corresponding 
quantities  for  b^n+^  .  But  we  find  that  the  information  functions  for  the 
two  scoring  methods  are  exactly  the  samel 

This  result  leads  us  to  examine  more  closely  the  relation  between 
b^n+^  and  Eu^  .  If  an  examinee  starts  at  b^^  =  0  and  finishes  at  b^n+^ 
after  n  steps  each  of  length  d  ,  it  is  clear  that  for  d  >  0  his  number  of 
right  answers  exceeds  his  number  of  wrong  answers  by  b^n+^/d  .  Since  the 
number  of  right  answers  plus  the  number  of  wrong  answers  equals  n  ,  we  have 
for  d  >  0 


,(n+l) 


n 

£  u 
v=l  ' 


-  (n  - 


n 

£  u  ) 
v=l  v 


or 


b^n+1^  =  d(2  £  u  -  n)  .  (57) 

v=l 

In  the  up-and-down  method  there  is  a  linear  relationship  between  final 
difficulty  score  and  number-right  score.  Thus  although  the  scores  show  very 
different  biases  and  sampling  variances,  they  are  equally  effective  for 
measurement  purposes. 


11.  Bio  assay 


1  n+1  (  ) 

In  bioassay  work,  the  average  difficulty  score  —  £  b'v'  is 

n  v=2 

commonly  recommended  for  the  up-and-<Lown  method.  A  considerable  amount  of 


theoretical  work  has  been  done  for  this  method:  Tsutakawa  (1967a,  19670, 
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1963)  was  primarily  concerned  with  asymptotic  results.  Wetherill  (1963) 
used  Monte  Carlo  methods  to  investigate  a  wide  variety  of  branching  and 
scoring  methods  empirically.  Djxon  (1965)*  Cochran  and  Davis  (19^^), 
Brownlee,  Hodges  and  Rosenblatt  (1953),  Dixon  and  Mood  (19^8),  and  others 
have  derived  useful  large-  and  small-sample  formulas.  They  also  obtained 
numerical  results,  mostly  for  n  =  10  or  12. 

The  method  of  approach  and  most  of  the  equations  of  section  12  are 
simple  extensions  of  those  in  the  cited  work  of  Brownlee,  Hodges,  and 
Rosenblatt.  This  same  general  approach  is  used  in  several  other  sections, 
including  section  7* 

It  will  be  worthwhile  to  point  out  some  of  the  similarities  and 
differences  between  our  problem  and  the  bioassay  problem,  as  commonly 
formulated.  The  bioassayist  typically  starts  with  equation  (l)  or  (3), 
just  as  we  have  here.  Whereas  we  control  a^^  and  b^  while  trying  to 
estimate  9  ,  the  bioassayist  controls  B  while  trying  to  estimate  the 
value  of  b  and  (sometimes)  the  value  of  a  .  For  him  9  might  be  the 
dosage  of  the  insecticide  applied,  for  example,  in  which  case  b  would 


be  the  LD50,  the  dosage  at  which  50  percent  of  the  treated  insects  die. 
The  bioassayist  chooses  the  dose  9^  ,  administers  it  to  one  insect  (or 


to  several),  and  observes  the  response:  survival  (  u  =  0  )  or  death  (  u  =  1  ). 

(2) 

He  then  chooses  a  dose,  0 '  '  ,  administers  it  to  another  insect,  and 


continues  in  this  way. 

Whereas  we  are  usually  only  interested  in  the  relative  values  of  9 
for  different  examinees,  and  often' only  in  the  rank  order  of  these  values, 
the  bioassayist  must  estimate  the  absolute  value  of  b  for  a  single  given 


insecticide.  Thus  the  bioassayist  uses  the  mean  squared  error  as  a 


criterion  of  effective  estimation,  whereas  we  use  the  information  function. 


« 
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Bias  is  a  serious  problem  for  the  bioassayist,  whereas  it  is  usually  of  no 
concern  to  us. 

The  fact  that  the  bioassayist  has  two  unknown  parameters,  a  and  b  , 
creates  a  very  serious  problem.  He  must  choose  a  step  size  d  without 
knowing  a  .  If  he  picks  d  too  large,  the  sampling  error  of  his  estimate  of 
b  will  be  excessive,  even  for  sizable  n  .  On  the  other  hand,  if  d  is 
small  and  a  happens  to  be  small  also,  the  true  value  of  b  may  be  so  far 
from  the  value  0^  at  which  the  bioassay  is  started  that  0^  can 
never  reach  the  value  b  in  n  steps  of  length  d  .  This  results  in  an 
unacceptable  bias  in  the  estimate  of  b  .  Without  some  knowledge  of  a  ,  there 
is  no  entirely  safe  way  of  choosing  d  .  It  is  possible  to  estimate  a  from 
the  observations  themselves  as  they  accumulate,  but  these  estimates  are  very 
unreliable  for  the  values  of  n  frequently  used  in  bioassay. 

Most  work  on  the  up-and-down  method  assumes  that  a  is  known  or 
else  that  a  can  be  bounded  (from  previous  experience)  within  certain  limits. 
In  the  latter  case,  the  step  size  must  be  chosen  uncomfortably  high,  to 
allow  for  the  possibility  that  a  may  be  small.  Here,  we  have  assumed  that 

in  mental  testing  the  necessary  item  parameters  have  all  been  determined 
by  pretesting,  with  good  accuracy.  The  result  of  all  this  is  that  we  will 
be  able  to  use  a  smaller  step  size  in  mental  testing  than  is  commonly 
recommended  for  the  up-and-down  method  in  bioassay.  This,  together  with 
the  fact  that  bias  is  usually  of  no  concern  to  us,  will  allow  us  to  obtain 
better  results  from  the  up-and-down  method  than  are  usually  possible  in 
bioassay. 
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12.  The  Up -and -Down  Method  with 
Average  Difficulty  Score 


Dixon  and  Mood  (19^+8),  starting  with  the  normal-ogive  model  for 

bioassay,  derived  approximations  to  the  maximum  likelihood  estimators  for 

^  n+1  /  \ 

a  and  b  .  Brownlee,  Hodges,  and  Rosenblatt  (1953)  proposed  using  —  £  6r  ' 

v=2 

to  estimate  b  in  the  bioassay  problem,  pointing  out  that  this  estimator  is 

asymptotically  equivalent  to  the  one  recommended  by  Dixon  and  Mood.  Our 

average  difficulty  score  -  £  b'v'  corresponds  directly  to  the  estimator 

n  v=2 

used  by  Brownlee,  Hodges,  and  Rosenblatt.  For  the  most  part,  we  will  study 

exact  small -sample  properties  of  the  average  difficulty  score  rather  than 

asymptotic  properties.  The  development  given  here  for  H  =  L  =  1  is  essentially 

the  same  as  that  of  Brownlee,  Hodges,  and  Rosenblatt,  except  that  they  are 

not  concerned  with  | ^  ,  nor  with  quantities  analogous  to  Dv(b)  ,  nor 

with  the  information  function  I  (e)  . 

x' 

In  this  section,  x  will  always  refer  to  the  average  difficulty  score. 

v+1  /  \ 

By  definition,  the  random  variable  X  (b,e)  ==  £  b'  *  will  be  the  sum  of 

v  r=2 

item  difficulties  obtained  under  the  up-and- down  method  when  the  first  item 
is  of  difficulty  b  and  the  examinee  is  at  ability  level  6  .  The  basic 
recursion,  corresponding  to  (20 )  for  the  final  difficulty  score,  is  seen 
to  be 


Ib  +  Hd  +  X  (b  +  Hd,  6)  with  probability  P,  , 

*  w 

b  -  Ixi  +  Xy(b  -  Ld,e)  with  probability  , 


(38) 


where  H  =  L  =  1  (the  symbols  H  and  L  are  not  needed  here,  but  will 
be  useful  in  later  sections). 
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Since  Xy  is  a  sum,  not  an  average,  let  us  define  Gy(b,6)  by 

Gv(b,e)  ==  xv(b,e)  -  ve  . 

Then  by  the  same  reasoning  used  in  section  7 

Ev+1(b)  =  PtEy(b  +  Hd)  +  (^(b  -  Ld)  +  Ex(b)  ,  (59) 

E  (b)  =  (Hd  -  t)P  -  (Ld  +  t)q  .  (1*0) 

Similarly,  the  mean  squared  error  is  found  from 
Wy+i(b)  =  PtWy(b  +  Hd)  +  ^(b  -  Ld)  +  W1(b) 

+  2Pt(Hd  -  t)Ey(b  +  Hd)  -  2<^(Ld  +  t)Ey(b  -  Ld)  ,  (4l) 

WL(b)  =  (Hd  -  t)2Pt  +  (Ld  +  t)2^.  .  (U2) 

Finally, 

Dv+i(b)  =  ^(b)  +  PtDy(b  +  Hd)  +  ^(b  -  Ld) 

+  [Ev(b  -  Ld)  -  Ev(b  +  Hd))  "3e  *  W 

^i(b)  =  -  d(H+  L)  gg-  .  (U4) 


Fig.  4.  Information  functions  for  the  up-and-down  method  with  final 
difficulty  score. 


If  the  icc  is  a  normal  ogive  with  or  without  c  =  0  ,  as  in  (3)  or  (5), 
then 


-  c1)*[a1(e  -  b^]  .  (*+5) 

Figure  3  compares  information  curves  for  some  final  difficulty  scores 
(marked  F)  with  those  for  the  corresponding  average  difficulty  scores  (marked 
A)  both  when  n  =  10  and  when  n  =  60  .  When  n  =  10  ,  the  final  difficulty 
score  with  ad  =  .05  is  best  when  6=0;  however,  effectiveness  has  fallen 
off  when  a 6  =  +  1.5  •  The  average  difficulty  score  with  n  =  10  ,  ad  =  .50 
is  not  quite  as  good  when  6=0,  but  is  distinctly  better  when  a6  =  +  1.5* 

When  n  =  60  and  ad  =  .05  ,  the  average  difficulty  score  is  better 
than  the  final  difficulty  score  throughout  the  range  -  2  <  a6  <  2  .  The 
final  difficulty  score  can  be  made  to  provide  more  information  near  6=0 
by  shortening  the  step  size  (the  curve  for  ad  =  .025  is  shown).  However,  as 
the  step  size  is  shortened,  the  information  curve  for  the  final  difficulty 
score  must  approach  the  curve  for  the  standard  test,  shown  as  a  solid  line . 
This  is  so  because  final  difficulty  score  is  perfectly  correlated  with 
number-right  score  (eq.  37)*  Thus,  shortening  the  step  size  produces  only 
small  gains  near  6  =  0  and  ultimately  leads  to  a  serious  loss  of  accuracy 
for  final  difficulty  score  at  more  extreme  values  of  6  . 

The  average  difficulty  score  with  ad  =  .20  is  better  than  any  of  the 
other  tailored  procedures  throughout  the  entire  range  shown.  It  is  almost 
as  good  as  the  standard  test  near  6=0  and  is  better  for  other  values 
of  6  .  This  result  cannot  be  improved  by  shortening  the  step  size.  It 
will  be  seen  that  when  the  average  difficulty  score  is  used,  too  short  a 
step  size  causes  loss  of  accuracy  throughout  the  entire  range  of  6  . 


We  conclude  tentatively  that  for  the  up-and-down  method,  the  average 
difficulty  score  is  preferable  to  the  final  difficulty  score,  at  least 
whenever  we  want  good  measurement  throughout  a  range  of  0  ,  not  just  at 
0  =  0.  There  may  be  some  exceptions  to  this  in  the  case  of  very  short  tests . 
Our  further  investigations  of  the  up-and-down  method  will  be  directed 
principally  at  60-item  tests  and  will  be  based  entirely  on  average  difficulty 
scores. 

All  the  information  curves  in  Figure  4  relate  to  the  up-and-down  method 
with  average  difficulty  score,  except  for  the  solid  curves,  which  show  the 
information  produced  by  the  standard  tests.  When  n  =  10  ,  the  best  step 
size  seems  to  be  between  d  =  .2/a  and  d  =  -5/a  .  With  this  step  sis 
the  tailored  test  is  85  percent  efficient  at  0  =  0  compared  to  the 
standard  test  and  is  more  efficient  than  the  standard  test  for  extreme 
values  of  0  .  When  n  =  60  ,  the  best  step  size  seems  to  be  roughly 
d  =  .2/a  .  With  this  step  size,  the  tailored  test  is  more  than  90  percent 
efficient  at  0  =  0  compared  to  the  standard  test.  At  a0  =  +  1.5  ,  the 
standard  test  is  only  48  percent  efficient  compared  to  this  tailored  test 
(i.e.,  the  standard  test  would  have  to  be  more  than  twice  as  long  as  the 
tailored  test  in  order  to  produce  the  same  amount  of  information  for 
examinees  at  0  =  +  1.5/&  )• 

It  is  no  surprise  that  step  size  can  be  too  small  for  effective 
measurement  of  examinees  at  extreme  values  of  0  —  n  cumulated  steps  may 
never  reach  the  item  difficulty  level  appropriate  for  the  examinee.  It  may 
well  seem  surprising,  however,  that  when  average  difficulty  score  is  used, 
step  size  can  be  too  small  for  measuring  examinees  at  0  =  0  ,  as  shown  in 
Figure  4.  After  all,  the  most  effective  measurement  for  such  examinees  in 
conventional  testing  is  obtained  when  b^  =  0  for  all  items. 
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Consideration  of  a  limiting  case  of  the  up-and-down  method,  the  case 
where  the  step  size  is  zero,  may  throw  some  light  on  the  apparent  paradox. 

If  the  step  size  is  zero,  all  examinees  will  have  the  same  average  difficulty 
score,  regardless  of  their  6  .  In  this  limiting  case,  it  is  clear  that 
number-right  score  can  provide  a  good  measure  of  6  whereas  average  difficulty 
score  can  not.  This  suggests  that  the  effectiveness  of  average  difficulty 
score  at  6  =  0  falls  off  when  step  size  becomes  too  small,  as  illustrated 
by  Figure  4. 


13 «  A  Comparison  of  Logistic  and 
Normal  Ogive  Item  Characteristic  Curves 

Except  for  this  section,  all  information  functions  given  in  this 
report  were  calculated  under  the  normal -ogive  model  given  by  (3)  or  (5)» 
Here,  in  Table  1,  we  compare  information  functions  obtained  from  (3)  for 
normal-ogive  ice's  with  those  obtained  from  (l)  for  logistic  ice's. 

The  numerical  differences  between  the  two  models  are  not  wholly 
negligible.  However,  it  does  not  seem  necessary  to  compute  all  information 

functions  under  both  normal -ogive  and  logistic  models. 

14.  The  Effect  of  Chance  Success 

Common  sense,  and  also  empirical  data,  tells  us  that  even  examinees 
at  the  lowest  6  levels  have  a  chance  considerably  greater  than  zero  of 
getting  correct  answers  to  multiple-choice  questions.  Clearly,  this 
will  result  from  guessing,  whether  random  or  nonrandom  (if  there  is  such 
a  thing  as  "nonrandom  guessing"). 

The  three -parameter  models  (2)  and  (5)  were  designed  to  fit  this 
situation.  The  parameter  c^  represents  the  probability  of  success  for 


Table  1 


A  Comparison  of  Information  Functions  for 
Normal -Ogive  and  Logistic  Ice's 


Step 

Test 

Size 

Length 

Icc 

Information  function  Ix(8) 

at 

ad 

n 

0  =  0 

+  1 

12 

:? 

1.0 

GO 

normal 

5.45 

5.45 

5.40 

5.37 

logistic 

5.41* 

5.43 

5.38 

5.34 

10 

normal 

2.22 

2.21 

2.10 

2.00 

logistic 

2.23 

2.20 

2.07 

1.97 

0.05 

60 

normal 

5-73 

5.48 

4.72 

3.53 

logistic 

6.10 

5.60 

4.41 

2.97 

10 

normal 

2.27 

1.97 

1.20 

0.45 

logistic 

2.42 

1.87 

O.98 

0.44 
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low-level  examinees.  There  is  no  need  to  specify  any  particular  relation 
between  c^  and  the  number  of  possible  responses  to  a  multiple-choice 
item.  With  5-choice  items,  practical  experience  indicates  that  most  items 
will  be  fitted  by  values  of  c^  between  .10  and  .20.  There  is  no  need  to 
assume  that  all  items  in  a  test  have  the  same  c^  .  However,  for 
simplicity,  we  here  assume  that  all  items  have  c^  =  .20  .  We  will 
investigate  whether  or  not  this  has  any  clear-cut  implications  for  tailored 
testing. 

Figure  5  compares  information  functions  for  c^  =  0  with  those  for 
c±  =  .20  for  all  items.  As  always,  the  standard  tests  are  shown  by  solid 
lines.  All  other  curves  are  for  the  up-and-down  method  with  average 
difficulty  score,  n  =  60  . 

The  figure  confirms  that  chance  success  in  answering  items  seriously 
reduces  measurement  efficiency.  The  loss  is,  of  course,  greatest  at  low  6 
levels,  but  in  tailored  testing  it  is  substantial  at  all  0  levels.  The  loss 
in  information  at  0  =  0  is  33  percent  for  the  standard  test,  3 8  percent  for 
ad  =  .05  ,  46  percent  for  ad  =  .20  ,  and  68  percent  for  ad  =  1.0  .  It 
appears  that  the  larger  the  step  size,  the  greater  the  loss  due  to  guessing. 

The  simple  up-and-dcwn  method  as  described  here  is  designed  to  move 
towards  a  situation  where  about  half  the  items  administered  to  an  examinee 
will  be  answered  correctly,  about  half  incorrectly.  This  is  the  proper 
ratio  when  there  is  no  chance  success,  but  it  has  long  been  recognized  that 
easier  items  are  preferable  when  chance  success  occurs.  This  conclusion 
is  particularly  obvious  when  c,  >  -5  ,  so  that  50-percent  success  is  at 
or  below  the  chance  level.  It  used  to  be  thought  that  for  optimum  measure¬ 
ment  the  examinee  should  answer  (l  +  c)/2  of  the  items  correctly.  This 
would  be  60  percent  of  the  items,  for  our  case  where  c  =  .2. 


Fig.  5*  A  comparison  of  information  functions  when  there  is  chance 
success  (  c  =  .2  )  and  when  there  is  not  (  c  =  0  ). 
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Lord  (1953,  PP»  67-69)  found  that  still  easier  items  than  this  would 
be  preferable.  A  formula  provided  by  Birnbaum  (1968,  eq.  20.4.22)  for  the 
logistic  model  shows  that  for  optimum  measurement  we  should  have 


1  +  >/l  + 

3 


(46) 


When  a^  =  .5  and  c^  *  .2  ,  we  find  that  b^  -  6  should  be  -.314,  in 
which  case,  by  (2),  P^e)  =  .653  * 

How  shall  we  arrange  matters  so  that  on  a  sufficiently  long  test  the 
examinee  will  eventually  be  answering  60  to  70  percent  of  the  items  correctly? 
One  possibility  is  to  make  the  step  size  in  the  positive  direction  smaller 
than  the  step  size  in  the  negative  direction.  We  will  investigate  specifically 
the  branching  rule  that  positive  steps  be  of  size  Hd  and  negative  steps  be 
of  size  Ld  ,  for  H  =  2  and  L  =  3  •  Also  for  H  =  1  ,  L  =  2  .  Also  for 
H  ■  1  ,  L  =  3  •  Although  these  are  still  up-and-down  methods,  we  will 
also  call  them  H-L  methods.  For  convenience,  we  will  speak  of  the  up-2-down- 
3  method,  the  up-l-down-2  method,  etc. 

13.  HrL  Methods 

Unless  otherwise  stated,  subsequently  reported  results  deal  with  the 
case  where  chance  success  occurs,  with  c^  =  .2  for  all  items. 

The  H-L  methods  are  random-walking  methods  designed  to  administer 
items  at  a  difficulty  level  appropriate  for  the  examinee.  This  should 
reduce  the  asymmetry  of  information  curves  such  as  those  shown  for  c  =  .2 
in  Figure  5* 

The  chance  nature  of  chance  success  introduces  a  random  element  into 
our  measurements  that  necessarily  must  reduce  their  accuracy.  We  cannot 
hope  to  regain  the  lost  information  by  tinkering  with  item  difficulty  levels. 
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All  we  car:  hope  to  accomplish  is  to  find  a  better  way  of  determining 
item  difficulties  than  the  simple  up-and-down  method  of  previous  sections, 
which  we  can  now  describe  »s  the  method  with  H  =  L  =  1  . 

We  will  continue  to  use  the  average -difficulty  method  of  scoring.  The 
necessary  recursion  equations  are  again  08  )  through  ^5  )>  this  time  with 
H  and  L  free  to  assume  any  integer  values. 

For  fixed  a  and  for  large  n  ,  the  quantity  d(H  +  L)/2  is  in 
practice  roughly  inversely  proportional  to  the  total  number  of  items  that 
will  be  prepared  and  stored  in  the  computer  (see  section  20).  Figure  6 
compares  four  H-L  methods,  each  of  which  has  d(H  +  L)/2  =  .25/a  .  The 
up-l-down-2  method  and  the  up-l-down-3  method  seem  superior  to  the  others. 

The  up-l-down-1  method  is  the  simple  up-and-down  method  discussed  in 
earlier  sections.  It  is  clearly  inferior  to  the  other  H-L  methods,  all 
of  which  tend  to  favor  easier  items. 

To  avoid  crowding,  curves  with  shorter  and  longer  average  step  size  are 
not  shown  in  the  figure.  On  the  basis  of  numerical  comparisons,  it  was  found 
that  substantially  reducing  the  step  size  gave  poorer  measurement  for 
6  -  +  2  ,  say,  without  improving  matters  at  6  =  0.  Increasing  the  step 
size  gave  poorer  measurement  for  -  2  <  6  <  +  2  without  much  gain  outside 
that  range.  Thus  the  curves  shown  seem  to  be  near-optimal  for  the  H-L  methods. 

l6.  Block  Up -and -Down  Methods 

Let  us  consider  other  ways  of  modifying  the  simple  up-and-down  method 
so  as  to  administer  somewhat  easier  items  in  situations  where  answers  may 
be  correct  because  or  chance  success.  If  the  score  on  a  single  item  were 
polychotomous  instead  of  dichotomous,  it  might  be  easier  to  arrange  an  up- 
and-down  procedure  under  which  the  examinee  will  get  about  two -thirds  of  the 


items  right.  This  suggests  combining  items  of  equal  difficulty  into  blocks. 
After  a  block  has  been  adminis tered,  the  score  on  the  block  is  used  to 
determine  which  block  shall  be  administered  next. 

In  bioassay,  the  block  up-and-down  method  often  has  great  advantages. 

For  example,  it  may  be  convenient  to  treat  several  insects  at  once  rather 
than  one  at  a  time.  The  blocking  method  has  been  investigated  by 
Tsutakawa  (1967a,  1967b,  1963),  by  Wetherill  (1963),  by  Cochran  and  Davis 
(196U),  and  by  Brownlee,  Hodges, and  Rosenblatt  (1953) •  It  is  not  clear 
that  blocking  items  adds  any  convenience  in  computerized  testing.  It  does, 
however,  make  possible  more  complicated  branching  processes  than  the  usual 
random  walk. 

Only  one  block  up-and-down  method  is  investigated  here.  The  blocks 
contain  two  items  each.  If  the  examinee  answers  both  items  at  difficulty 
level  bj,  correctly,  the  next  block  will  have  items  of  difficulty  bi  +  d  . 

If  he  answers  only  one  correctly,  the  next  two  items  will  be  at  difficulty 
level  b^  .  If  he  answers  neither  correctly,  the  next  block  will  be  at 
difficulty  level  b^  -  3d  . 

If  average-difficulty  score  is  used,  the  basic  recursion  equation  for  this 
method  is 


xv+2(M)  = 


2b  +  2d  +  X  (b  +  d,  6)  with  probability  P7 

V  w 

2b  +  Xv(b,0)  with  probability  2P^.Q^ 

2b  -  6d  +  Xv(b  -  3d,  9)  with  probability  , 


(^7) 


where  the  random  variable 
of  the  items  administered 


v+1  fr) 

X  (b, 0)  =  2  b^r'  is  the  sum  of  the  b.  values 
r=2  1 

to  an  examinee  at  ability  level  9  under  the 
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specified  block  up-and-down  method  with  b^  =  b  •  The  other  necessary 
equations  can  be  derived  from  (47)  but  will  not  be  written  out  here. 

Results  of  the  calculations  showed  this  particular  method  to  be  inferior 
to  a  variety  of  the  simple  H-L  methods  described  in  section  15»  Many  other 
block  methods  could  be  tried  out.  This  has  not  been  done  here,  however. 

17.  Plicate  Methods 

When  c  /  0  ,  we  need  to  produce  an  asymmetry,  so  that  the  examinee 
is  more  likely  to  give  a  right  answer  than  a  wrong  one.  An  obvious  device 
is  to  rule  that  a  correctly  answered  item  need  not  always  be  followed,  as 
in  the  simple  H-L  methods,  by  a  more  difficult  item. 

Here  we  investigate  a  two-ply  method,  defined  as  follows.  Whenever 
the  examinee's  number-right  score  on  the  items  already  administered  is  an 
odd  number,  the  next  item  is  assigned  by  the  H-L  method  with  H  =  1  and 
L  =  1  (this  is  the  simple  up-and-down  method).  Whenever  the  examinee's 
number-right  score  on  the  items  already  administered  is  an  even  number, 
we  assign  the  next  item  by  setting  H  =  0  and  L  =  1  . 

We  will  also  investigate  a  three-ply  method;  when  the  examinee ' s 
number-right  score  is  not  a  multiple  of  three,  H  =  1  and  L  =  1  ;  when 
it  is  a  multiple  of  three  H  =  0  and  L  =  1  .  These  and  other  similar 
methods  will  be  called  plicate  methods.  The  examinee's  final  score  need 
not  be  his  number-right  score.  Here  in  all  cases  his  final  score  will  be 
his  average  difficulty  score. 

The  necessary  basic  equations  for  the  two-ply  method,  again  derived 
by  the  same  line  of  reasoning  used  in  sections  12  and  7,  will  be  given. 

The  extension  to  the  three-ply  and  to  other  patterns  is  straightforward. 

The  average  item  difficulty 


X 
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,  n+l 

*n  S- 
v*2 


(v) 


is  the  examinee's  score*  The  random  variable  X(b,e) 


v+1 
£  b 


(r) 


is  the  sum 


.  r=2 

of  the  item  difficulties  when  the  two*  ply  method  is  used  with 

b^  =  b  ,  the  examinee  being  at  ability  level  6  . 

To  start  with,  we  will  need  to  consider  the  alternate  two-ply  method 

in  which  H  =  0  and  L  =  1  when  the  number -right  score  is  odd,  and  H  =  1 

and  L  =  1  when  the  number-right  "score  is  even*  A  complete  set  of 

equations  will  be  needed  for  this  "alternate"  pattern  as  well  as  for  the 

original  pattern.  A  prime  vrl.ll  be  attached  to  all  quantities  computed  under 

V+1  (r) 

the  alternate  pattern.  Thus  the  random  variable  X'(b,6)  is  the  sum  £  b v  1 

v  r=2 

obtained  by  an  examinee  at  ability  level  9  when  the  alternate  pattern 
is  used  to  pick  the  items  administered. 

The  basic  recursions  are  given  by  two  formulas: 


(b,e)  = 


b  +  d  +  X'(b  +  d,S)  with  probability  P. 

V  u 

b  -  d  +  Xy(b  -  d,@)  with  probability  ; 


x;+1(b>e)  = 


b  +  Xv(b,0)  with  probability 
b  -  d  +  X^.(b  -  d,0)  with  probability  . 


0+9) 


We  can  halve  the  number  of  equations  to  be  written  down  by  using  the 
superscripts  *  and  o  with  the  understanding  that  either  one  of  these  is 
to  be  omitted  while  the  other  is  to  be  replaced  by  a  prime.  Then,  letting 
H  =  1  and  H'  =  0  ,  the  following  equation  can  represent  both  (48)  and  (49): 


PHWisw 


b  +  H°d  +  X*(b  +  H°d,e)  with  probability  Pt 
b  -  d  +  X^(b  -  d ,6)  with  probability  . 


(50) 


Ci<b’e>  - 


From  this  we  can  derive: 

E°+1(b)  =  E°(b)  +  PtE*(b  +  H°d)  +  ^(b  -  4)  . 

(51) 

E°(b)  =  (H°d  -  t)Pt  -  (d  +  t)ft,.  . 

(52) 

W^b)  .  w£(b)  +  PtW*(b  +  H°d)  +  <^(b  -  d) 

+  2Pt(H°d  -  t)E*(b  +  H°d)  -  2Qfc(d  +  t)E°(b  -  d)  .  (53) 

V^(b)  =  (H°a  -  t)2Pt  +  (a  +  t)2^  .  (54) 

D°+1(b)  =  D®(b)  +  PtD*(b  +  H°d)  +  ^“(b  -  d) 

+  [E°(b  -  d)  -  E*(b  +  H°d)]-^  .  (55) 

do 

D°(b,y)  =  -  d(l  +  H0)^-  .  (56) 


Figure  7  compares  the  best  results  obtained  with  the  H-L  method,  the 
two-ply  method,  and  the  three-ply  method.  The  three-ply  curve  for  ad  =  .2 
is  almost  the  same  as  the  two-ply  curve  shown,  but  slightly  lower.  It 
appears  that  the  two-ply  method  is  slightly  superior  when  c  =  .2  to  the 
three-ply  method.  The  main  conclusion  is  that  all  the  methods  shown  in  the 
figure  are  almost  equally  good  when  the  proper  step  size  is  used. 


Figure  7  has  been  shaded  for  6  >  2  and  9  <  -1  to  call  attention 
to  a  way  of  improving  measurement  that  has  not  been  mentioned  up  to  now* 

Given  the  items—  the  a^  ,  b^  ,  and  c^--  the  information  function  does 
not  depend  at  all  on  the  nature  of  the  group  tested*  In  particular,  the 
information  curve  depends  on  9  and  b^^  only  through  their  difference 
9  -  b^  •  This  has  been  indicated  in  Figure  7  by.  labelling  the  base 
line  a (©  -  b^)  .  In  looking  at  earlier  results  (see  discussion  in  section 
11 ),  we  have  frequently  evaluated  them  for  a  group  of  examinees  in  the 
range  -  1*5  <  a(0  -  b^)  <  +  1*5  •  Suppose  we  keep  the  same  group  of 
examinees  and  the  same  tailored  testing  procedure  except  that  the  first 
item  administered  is  taken  at  difficulty  level  b^  =  -  0*5  instead 
of  at  b^^  =  0  .  The  examinees  now  fall  in  the  interval  -  1  <  a (6  -  b^)  <  +  2 
Figure  7  shows  that  we  get  better  measurement  in  this  interval  than  in  the 
other  one.  The  information  function  is  nearly  the  same  from  one  end  of 

the  interval  to  the  other.  This  is  the  result  achieved  by  choosing  the 
first  item  administered  at  an  easier  difficulty  level  than  we  have  used  up 
to  now.  The  two-ply  method  with  step  size  d  =  *5/a  is  86  percent  efficient 

at  9  -  b^1^  =  .5/a  compared  to  the  standard  test.  That  is,  it  produces  as  much 
information  with  n  =  60  items  as  would  a  standard  test,  =  0  ,  with 
n  =  52  items.  Here  we  have  assumed,  as  in  the  last  several  sections,  that 
all  icc  are  normal  ogives  and  that  all  c^  =  0.2  .  Except  at  6  -  b^  =  .5/a 
the  two-ply  test  is  more  than  86  percent  efficient  compared  to  the  standard 
test.  At  b  -  b^  =  -  l/a  the  standard  test  is  only  49  percent  efficient 
compared  to  the  two-ply  test;  At  6  -  b^  =  +  2/a  ,  the  standard  test  is 
only  50  percent  as  efficient. 
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The  standard  test  is  a  "peaked"  test  with  all  items  of  equal  difficulty, 
fc^  =  0  .  This  produces  unusually  accurate  measurement  around  one  value  of 
0  (near  0  -  b^  =  0.3/a  in  our  case)  at  the  expense  of  less  accurate  measure¬ 
ment  at  extreme  values  of  0  .  The  typical  published  test  has  a  wide  range  of 
b^  values  (partly  because  it  is  easier  to  produce  such  a  test  and  partly 
because  most  people  believe  a  range  of  item  difficulty  is  necessary  to 
secure  good  measurement.  Actually,  for  typical  groups  and  typical 
values  of  a^  ,  the  peaked  test  would  usually  provide  better  measurement 
for  all  except  the  most  extreme  examinees  in  the  group  tested,  as 
is  illustrated  in  Figure  7).  Actual  b^  values,  estimated  for  a  well-known 
published  test,  were  used  to  compute  the  information  curve  labelled 
"published  test"  in  Figure  7»  The  curve  shown  was  computed  from  Birnbaum' s 
equation  (19^8,  eq.  20.2.2) 

[  £  PJ(e)]2 

i(e,x)  =  -i=± - 

E 

i=i 


under  the  assumption  that  ai  was  the  same  for  all  items  and  also  that 
=  .20  (the  actual  published  test  does  not  satisfy  this 
assumption).  The  60  values  of  b^  ranged  from  -  2.5  to  +  1.5  •  We 
note  that  the  tailored  tests  all  measure  more  accurately  at  all  values  of  0 
than  does  such  an  unpeaked  test. 


The  theory  up  to  this  point  has  assumed  that  items  were  available  at 
whatever  difficulty  level  was  required  by  the  random  walk  procedure.  In 
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principle,  this  would  require  for  the  H-L  or  plicate  methods  a  total  of 
n(n  +  l)/2  items,  regardless  of  step  size.  When  c  =  0  ,  there  is  usually  a 
vanishingly  small  probability  of  needing  items  with  b^  >  5  ,  say.  If 
we  supply  all  the  items  that  theoretically  might  be  required  in  the  range 
-  5  <  b^  <  5  and  no  items  outside  this  range,  the  simple  up-and-down  method 
will  require,  depending  on  the  details  of  the  random  walk,  about 
(l  +  5/d)(n  -  5/2d)  items,  assuming  d  is  a  submultiple  of  5*  For  sizable 
n  ,  this  quantity  is  roughly  inversely  proportional  to  the  step  size  d  .  For 
n  -  60  ,  d  =  .50  ,  the  number  of  items  required  would  be  about  600. 

Presumably  this  number  could  be  considerably  reduced  without  much  loss 
of  accuracy  of  measurement.  Whenever  items  are  in  short  supply  (many 
shortages  should  disappear,  given  enough  time)  practical  applications 
of  tailored  testing  will  have  to  consider  which  short-cuts  will  cause  the  least 
loss  of  efficiency.  Much  light  could  be  thrown  on  this  question  by 
Monte  Carlo  studies  of  different  possible  procedures. 

It  would  be  desirable  to  assign  a  cost  to  items,  a  cost  to  computer 
storage  space,  and  a  loss  function  to  errors  of  measurement.  One  could 
then  investigate  the  economic  efficiency  of  various  tailored  testing 
procedures  under  various  cost  conditions.  Something  of  this  sort  will  have 
to  be  done,  whether  formally  or  informally,  before  tailored  testing  comes  into 
widespread  use.  Nothing  like  this  has  been  attempted  for  the  present  report. 

19.  Robbins-Monro  Processes 

It  will  have  occurred  to  the  reader  that  a  large  step  size  is  needed 
for  the  first  few  items  so  that  the  item  difficulty  can  be  rapidly  adjusted 
to  suit  an  extreme  examinee.  Once  this  has  been  done,  progressively 
smaller  step  sizes  are  needed  in  order  to  zero  in  on  the  best  value. 
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Probably  considerable  gain  could  be  achieved  sisply  by  choosing  d  =  1  , 
say,  for  the  first  three  or  four  items  and  thereafter  proceeding  with  a 
smaller  but  fixed  d  ,  by  any  of  the  methods  of  the  preceding  sections* 

This  rule  has  not  yet  been  investigated  for  tailored  testing* 

Carrying  this  idea  to  its  logical  conclusion  leads  to  a  Robb  ins -Monro 
type  of  process*  Let  P(b,0)  denote  any  monotonic  increasing  item 
characteristic  curve  satisfying  certain  regularity  assumptions*  An  important 
advantage  of  the  method  is  that  we  do  not  have  to  know  the  mathematical 
form  of  P(b,0)  •  We  wish  to  find  the  value  of  b  for  which  P(b,0)  equals 
some  chosen  constant,  a  (under  the  normal-ogive  model  of  equation  3,  &  is 
usually  chosen  as  l/2  ,  since  P(0,e)  ■  l/2). 

In  a  Robbins-Monro  process,  we  choose  a  decreasing  sequence  of  positive 
constants  *.*  ,  to  be  discussed  later.  We  choose  a  trial  value 

of  b  for  the  first  item—  let  this  be  *  0  .  We  administer  the 

item  and  observe  the  response  u^  ■  0  or  1  .  Then  we  choose  all 
subsequent  values  of  bx  7  by  the  rule 

„<**!>  _„M+V(1V.  a).  (57) 

This  leads  to  an  H-L  method  with  continually  shrinking  step  size*  Robbins  and 
Monro  (1951 )  showed  that  under  suitable  conditions  b^  converges  in 
probability  to  6  as  v  -»  «  ,  that  is,  b'  7  is  a  consistent  estimator 
of  9  . 

Chung  (1954)  and  Hodges  and  Lehmann  (1956)  showed  that  under  certain 
conditions,  satisfied  for  our  problem,  if  the  A^  are  determined  from 


_ 
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Ay  BB  c/v  ,  (58) 

where  C  is  a  constant  to  be  determined,  the  asymptotic  sampling  variance 
of  b^  will  be  minimized.  Hodges  and  Lehmann  (1958,  section  5)  show  that 
for  our  kind  of  quantal  response  problem,  the  value  of  C  minimizing  the 
asymptotic  sampling  variance  is 


C 


dP(b7eJ/Sb  b**0 


(59) 


Of  course,  this  value  cannot  be  determined  without  knowing  the  nature  of 
p(b,e)  .  In  our  case,  this  optimal  value  for  the  normal  ogive  model  of 

(5)  is 


C 


v5T 

‘  ci^ 


(60) 


If  c^  ■  0.2  and  a^  ■  1.0  ',  then  the  initial  step  size  under  (58) 
and  (6o)  is  ==  c  =  3*1  •  If  ai  =  .353  ,  then  A1  =  9.4  ,  if  =  .10  , 
then  A1  *  31.3  ,  etc. 

However,  we  have  been  assuming  that  almost  all  our  examinees  fall 
within  some  roughly  known  interval  centered  about  the  origin  of  our  6 
scale.  If  we  have  such  information  (obtained  from  previous  testings)  and 
if  we  are  willing  to  forego  accurate  measurement  of  the  very  rare  examinee 
who  falls  outside  the  interval,  then  we  probably  do  not  need  to  use  an 
initial  step  size  greater  than  about  half  the  width  of  the  interval. 

It  would  be  very  desirable  to  obtain  information  functions  for  the  test 
scores  b^  produced  by  the  stochastic  approximation  method  outlined. 


Unfortunately,  there  appears  to  be  no  available  method  of  doing  this 


precisely  for  a  test  longer  than  a  dozen  or  so  items*  Recursion  formulas 
cannot  be  ‘worked  out  by  the  methods  used  in  earlier  sections* 

Asymptotic  formulas  are  available,  but  these  would  produce  information 
functions  that  are  horizontal  straight  lines.  Previous  sections  indicate 
that  optimum  step  size  is  found  by  reducing  step  size  as  far  as  possible 
without  making  the  information  curve  fall  off  too  sharply  on  both  sides  of 
the  maximum.  Asymptotic  results  would  be  useless  for  this  purpose* 

The  only  good  way  to  deal  with  this  problem  would  seem  to  be  by 
using  Monte  Carlo  procedures.  For  various  reasons,  we  have  not  used  any 
Monte  Carlo  procedures  for  this  study.  Wetherill  (1965)  made  extensive 
Monte  Carlo  investigations  of  the  Robb  ins -Monro  method  from  the  bioassay 
point  of  view.  He  found  that  the  method  was  not  sensitive  to  mlschoice  of 
C  .  The  method  was  "extremely  satisfactory"  in  the  case  where 
PtCbw,e)  -4  |  as  n  becomes  large  (we  have  this  case  when  c^  *  0  ); 
but  "of  little  use,  ”  "unsuitable, "  "hopeless"  in  the  case  where 
-*  *75  (we  have  this  case  when  c^  =  *5  )» 

The  main  trouble  in  this  latter  case  (  -♦  *75  )  i®  the  large 

bias  produced  by  the  Robbins -Monro  method*  As  already  pointed  out  in 
section  11,  bias  is  a  serious  matter  in  bioassay  work,  but  usually  of  no 

concern  in  mental  testing.  For  this  reason,  and  for  other  reasons  indicated 

in  section  11,  it  is  difficult  to  apply  many  of  Wetherill’ s  reported 
results  to  the  purposes  of  the  present  study.  [Wetherill  also  tried  out  many 
promising  up-and-down  methods,  which  could  not  be  evaluated  by  the 
techniques  used  for  thl6  study..  These  could  and  should  be  investigated 
for  tailored  testing  purposes  by  use  of  Monte  Carlo  methods*] 
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The  use  of  (58)  to  decide  on  step  size  is  practical  in  bioassay,  but  is 
not  practical  in  tailored  ‘testing*  In  principle,  use  of  (53)  vould  require  that 
almost  211  different  items  be  available  to  the  computer- -an  impossible  re¬ 
quirement  if  n  =  60  .  An  obvious  modification  is  to  classify  all  items  on 
into  class  intervals  of  width  d  ,  and  then  pick  a  sequence  of  class  Intervals 
corresponding  as  closely  as  possible  to  the  sequence  specified  by  (57)* 

This  modification  destroys  all  the  asymptotic  virtue  of  the  Robb  Ins  - 
Monro  process,  since  the  use  of  fixed  class  intervals  prevents  convergence 
of  to  0  as  n  -*»  .  However)  this  difficulty  need  not  Impair 

the  information  actually  produced  by  a  test  of  fixed  length.  If  a  psychometrician 
is  going  to  administer  a  test  of  fixed  length  n  ,  there  is  no  good 
reason  why  he  should  use  a  consistent  estimator  of  0  • 

The  way  of  choosing  items  outlined  above  was  tried  out  for  this 
study  for  short  tests  with  n  ■  ID  and  c^  *  .20  .  With  n  «  10  ,  it  was 
possible  for  the  conputer  to  deal  individually  with  each  of  the  possible 
patterns  of  examinee  response.  Average  difficulty  score  was  used,  not 
final  difficulty  score,  as  in  the  standard  Robbins -Monro  process. 

No  choice  of  C  was  found  that  yielded  information  curves  as  good  as 
those  obtained  by  the  better  up-and-down  methods.  Detailed  results  will 
not  be  given  here. 

Another  possibility  for  improving  the  estimation  of  0  is  to  use  a 

weighted  rather  than  an  unweighted  average  of  item  difficulties  for  the 

examinee's  score.  The  difficulty  b^  of  the  first  item  has  always  been 

omitted  from  the  average  difficulty  score  (19)  since  it  is  the  same  for 

all  examinees  and  thus  cannot  carry  any  information  about  0  •  If  step 

(2) 

size  is  small,  the  same  objection  applies  to  a  lesser  extent  to  bx  7  . 

Clearly,  the  item  difficulties  for  the  earlier  items  are  not  expected  to 
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be  as  close  to  6  as  those  for  the  later  items*  This  suggests  that 
the  later  items  should  receive  more  weight  in  determining  final  score  than 
the  earlier  items* 

Two  systems  of  weighting  were  tried  out  here: 

x  *  E  VT  b^  ,  (6l) 

v*2 

x  -  T  vb<v>  .  (62) 

v*2 

The  weighted  score  (6l)  with  weights  proportional  to  the  square  root  of 
the  item  serial  number  v  was  found  to  be  a  little  better  than  the 
unweighted  average  difficulty  score*  The  weighted  score  (62)  with  weights 
proportional  to  item  serial  number  was  still  better  than  (6l). 

It  would  be  desirable  to  have  results  for  n  *  60  as  well  as  for 
n  ■  10  •  In  view  of  the  incomplete  nature  of  presently  available  results, 
no  infoxmation  functions  will  be  displayed  here*  There  is  clearly  a  need 
for  extensive  Monte  Carlo  studies  to  investigate  further  the  methods  already 
outlined  and  the  numerous  possible  recombinations  and  mutations  of  these 
methods* 

20*  Summary  and  Conclusions 

When  computers  are  used  extensively  for  instruction,  it  will  be 
convenient  to  use  them  to  administer  tests  for  measurement  purposes  as 
well  as  for  instructional  purposes*  We  restrict  attention  here  to  tests 
used  for  measurement*  Oiven  a  supply  of  pretested  items,  the  computer  can 
individually  design  a  test  for  each  individual  examinee*  How  can  it  secure 
the  most  accurate  measurement? 
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The  question  falls  into  two  parts:  What  items  shall  be  administered 
to  a  particular  examinee?  How  shall  his  responses  be  scored?  We  do  not 
find  optimum  procedures.  We  compare  various  simple  procedures  and  answer 
a  number  of  primitive  but  previously  unanswered  questions* 

A  number  of  empirical  studies  of  tailored  tests  have  been  carried  out 

*  < 
using  an  external  criterion  to  evaluate  the  results*  Here  we  have  no  external 

criterion.  Our  purpose  is  to  measure,  not  to  predict. 

The  examinee  can  be  measured  by  the  following  simple  branching  rule: 
Administer  a  harder  item  after  each  correct  answer,  an  easier  item  after  each 
wrong  answer.  Start  with  large  changes  in  item  difficulty  (  b  ),  then  use 
smaller  and  smaller  steps,  so  as  to  zero  in  on  the  difficulty  level  at  which 
the  examinee  answers  correctly  50  percent  (say)  of  the  time.  This  final 
difficulty  level,  b^n+1^  ,  is  a  measure  of  the  examinee's  standing  on  the 
trait  measured  by  the  test. 

This  is  the  Robb  ins -Monro  stochastic  approximation  process. 
Asymptotically  optimum  details  for  the  procedure  are  known.  To  investigate 
its  efficiency  in  subasymptotic  situations  requires  Monte  Carlo  methods. 

We  do  not  use  these  methods  here. 

Instead,  we  investigate  various  branching  rules  that  do  not  shrink 
the  step  size  as  testing  progresses.  We  also  investigate  three  different 
scoring  methods.  We  evaluate  the  efficiency  of  these  procedures  for  short, 
medium,  or  long  tests,  as  desired.  Our  problems  are  found  to  be  very 
closely  related  to  certain  problems  in  bioassay.  Many  important  results 
obtained  for  the  bioassay  problems  are  of  direct  use  to  us  here. 

To  keep  matters  simple,  we  assume  that  available  items  differ  only 
in  difficulty  (  b^).  In  order  to  coup  are  the  efficiency  of  various 
branching  and  scoring  procedures,  we  often  assume  that  the  probability  of 


a  correct  answer  to  an  item  is  a  normal  ogive  function  of  9  ,  the  examinee's 
standing  on  the  trait  measured*  Alternatively,  a  logistic  function  is  assumed* 
Both  assumptions  are  generalized  to  cover  the'  case  where  correct  answers 
may  be  due  to  guessing. 

Conventional  tests  ordinarily  provide  good  measurement  for  the  middle 
three-quarters,  say,  of  the  group  tested*  The  tailored  test  cannot  hope  to 
provide  much  improved  measurement  for  these  examinees,  but  it  can  provide  better 
measurement  at  higher  and  lower  levels  of  6  .  In  view  of  this  picture, 
we  are  not  satisfied  to  use  an  overall  group  statistic  to  describe  the 
effectiveness  of  measurement;  instead,  we  use  an  information  function  to 
tell  us  the  accuracy  of  measurement  at  each  level  of  9  . 

We  compute  and  compare  information  curves  for  a  variety  of  procedures. 

Some  of  the  conclusions  tentatively  reached  for  a  certain  specified,  presumably 
typical,*  tailored  test  using  a  simple  up-and-down  branching  rule  are  listed 
below.  This  rule  increases  (decreases)  item  difficulty  by  an  amount  d 

after  each  correct  (incorrect)  response* 

1.  If  the  test  score  is  b^n+1^  and  the  step  size  d  is  .50,  increasing 
the  number  of  test  items  beyond  n  =  10  does  not  appreciably 
increase  the  accuracy  of  measurement. 

2*  The  number  of  items  answered  correctly  is  perfectly  correlated  with 
the  score  b^n+1^  . 

5*  When  the  foregoing  scores  are  used,  decreasing  the  step  size 

increases  the  accuracy  of  measurement  near  some  one  value  of  6  and 

•  .  •  * 

decreases  it  elsewhere. 

* 

The  conclusions  given  are  for  items  with  ai  =  .5  . 
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4.  The  average  difficulty  of  all  items  administered  is  a  score  providing 
better  measurement  of  the  examinee  than  either  b^n  ^  or  the 
number  of  right  answers* 

In  the  conclusions  that  follow,  the  score  used  is  always  average  difficulty. 

5*  For  n  ■  60  ,  a  step  size  of  around  »k0  seems  best;  for  n  =  10  , 
a  step  size  of  around  1.0*  Either  shortening,  or  lengthening  the 
step  size  decreases  the  accuracy  of  measurement  throughout  the 
range  of  6  that  interests  us. 

6.  When  items  can  be  answered  correctly  by  chance  success,  the  accuracy 
of  measurement  is  sharply  reduced.  Also,  the  information  curves 
become  asymmetrical. 

7.  When,  there  is  chance  success,  the  accuracy  of  measurement  can  be 
considerably  increased  by  certain  asymmetric  modifications  of  the 
step  size  used  in  the  simple  up-and-down  method.  Two  such  modifica¬ 
tions  are  found  to  be  almost  equally  good. 

8.  When  there  is  chance  success,  the  first  item  administered  should  be 
easier  than  that  specified  by  a  common  rule  of  thumb. 

9.  These  improvements  produce  nearly  symmetrical  and  reasonably  flat 
information  curves. 

10.  When  these  tailored  testing  procedures  are  compared  with  a  "standard" 
conventional  peaked  test,  also  with  a  conventional  unpeakecl  test, 
both  scored  by  the  number  of  right  answers,  we  see  that  the  best 
tailored  testing  procedure  is  nqwhere  less  than  86  percent  efficient 
compared  to  the  peaked  test.  For  high-level  examinees,  the  peaked 
test  is  only  about  30  percent  efficient  compared  to  the  tailored 
testing.  The  tailored  procedure  gives  more  accurate  measurement 
than  the  unpeaked  conventional  test  for  all  examinees,  regardless 


of  level. 


Before  closing,  let  us  note  some  of  the  limitations  of  tailored  testing 
procedures  and  of  the  theory  given  here* 

1*  Suppose  a  pool  of  test  items  can  be  grouped  into  subtests  measuring 
substantially  different  psychological  dimensions.  Without  such 
grouping  into  subtests,  tailored  testing  based  on  such  a  pool  cannot 
produce  accurate  measurements  with  a  clear  meaning* 

2.  The  theory  given  here  assumes  that  items  differ  from  each  other 
only  on  difficulty  level*  In  practice,  they  differ  also  on 
(discriminating  power)  and  on  c^  (see  text)*  It  is  an  open 
theoretical  question  how  tailored  testing  should  be  modified  to 
deal  with  this  more  general  situation. 

3*  Accurate  estimation  of  the  item  parameters  necessary  for  tailored 
testing  is  at  present  a  difficult,  expensive,  and  hazardous 
operation* 

4*  If  there  is  any  doubt  about  the  accuracy  of  the  estimated  values  of 
the  item  difficulties,  b^  ,  there  will  be  doubt  about  the  accuracy 
and  fairness  of  the  final  scores  given  to  the  examinees. 

5*  If,  say,  300  items  are  available  for  tailored  testing,  better 
measurement  will  often  be  obtained  by  selecting,  say,  the 
n  «  60  most  discriminating  items  (highest  )  and  administering 
these  as  a  conventional  test,  rather  than  by  using  all  300  in  a 
tailored  testing  procedure* 

Until  now,  even  some  very  primitive  questions  about  how  to  carry  out 
tailored  testing  did  not  have  even  vague  answers*  Granted  certain  assumptions 
we  now  have  tentative  answers  to  some  of  these  questions*  More  important, 
we  have  a  theoretical  approach,  drawing  heavily  on  bloassay  theory  and 
results,  and  on  the  theory  of  stochastic  processes  with  particular  reference 


to  Markov  chains.  This  theory  shows  how  we  can  go  ahead  to  evaluate  the 
endless  variety  of  different  possible  combinations  of  branching  processes 
and  scoring  procedures  available  for  tailored  testing.  Perhaps  in  due 
course,  some  direct  way  of  finding  truly  optimum  tailored  testing 
strategies  will  be  found. 

The  theory  in  this  report  is  based  on  certain  rather  technical  and 
specialized  assumptions.  Most  of  the  conclusions  reached  are,  hopefully, 
of  more  general  validity.  The  tailored  testing  procedures  themselves 
can  provide  accurate  measurements  without  any  need  for  many  of  these 
assumptions. 

We  have  Investigated  in  detail  only  those  tailored  testing  procedures 
that  could  be  rigorously  evaluated  numerically  for  tests  of  60  items  (and 
longer).  Up-and-down  item-selection  methods  with  continually  shrinking  step 
size  (section  19)  should  be  able  to  produce  more  accurate  measurement  than 
is  obtained  by  methods  without  a  shrinking  step  size.  There  is  a  clear 
need  for  studies  to  evaluate  various  possible  shrinking-step-size  procedures 
for  tests  much  longer  than  n  =  10  or  12  .  These  studies  will  probably 
have  to  be  carried  out  by  Monte  Carlo  methods. 

Small  sample  properties  of  the  maximum  likelihood  estimator  of  6 
(Tsutakawa,  1967a,  1967b;  Billingsley,  1961;  Dixon,  1965;  Dixon  &  Mood, 

19**6)  should  also  be  investigated.  Another  estimation  method  requiring 
further  study  by  Monte  Carlo  methods  is  the  Spearman-K&rber  method 
(Spearman,  1908;  Kftrber,  1951) •  This  method  is  described  and  favorably 
evaluated  for  bioassay  purposes  by  Tsutakawa  (1967a). 
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This  appendix  outlines  one  or  two  results  from  the  theory  of  Markov 
chains,  relevant  for  evaluating  tailored  testing  procedures. 

We  are  concerned  (as  In  section  5)  with  the  random  variable  b^  , 
the  difficulty  of  the  v  -th  item  administered  to  a  given  examinee.  The 
possible  values  of  b^  are  (see  eq.  15)  b^  =  jd  ,  where  d  is  some 
prespecified  step  size  and  j  is  a  (possibly  negative)  Integer  with 

I  d  I  <  n  •  There  will  be  no  loss  of  generality  for  the  purposes  of  this 

(v)  fv) 

appendix  if  we  rescale  b'  7  so  as  to  set  d  «=  1  ,  in  which  case  bv  7 

only  takes  on  Integer  values  between  -  n  and  +  n  ,  inclusive.  Denote 

such  integer  values  by  either  j  ,  k  ,  or  i  • 

Define  the  transition  probability 


, 


id 


Prob  (b(v+1)  =  jlb^  =  i)  ,  v  =  1,2, 


(65) 


By  the  Markov  property,  this  does  not  depend  on  b^,...,b^v"^  .  We  will 
consider  only  stationary  transition  probabilities,  which  means  that  p^j 
does  not  vary  with  v  (this  rules  out  all  branching  methods  with  shrinking 
step  size). 

Define  the  r  -step  transition  probabilities 

p|j^  «■  Prob  (b^v+r^  *  jlb^  *  i)  ,  r,v  ■  1,2,...  .  (610 


It  is  easily  seen  that 
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If  the  ■*  are  written  as  the  elements  of  a  square  matrix  P  *  lip^j  II 

of  order  2n  +  1  ,  then  the  are  elements  of  P^  *  PP  •  Similarly, 

the  p^  are  elements  of  P1^  : 


iipg>“  *  V  • 


(65) 


For  any  given  examinee,  the  nonzero  elements  of  the  matrix  P  are  the 
P1(0)  and  ^(0)  of  (l),  (2),  (3),  or  (5),  or  simple  functions  of  these, 
depending  on  the  branching  process  chosen. 

Let  pQ^  denote' the  probability  that  b^  *  i  and  let  p  be  the 
vector  (Poi)  •  I*  we  choose  our  origin  so  that  b^  *  0  ,  then  pQ^  is 
zero,  except  that  when  i  ■  0  ,  then  pQi  is  1. 

The  final  frequency  distribution  of  b^n+1^  for  a  given  examinee  is 
thus  the  vector  P,np  : 

{Prob  (b^  «  ile))  -  P,np  .  (66) 


The  mean  and  variance  of  this  distribution  are  important  quantities  related  in 
a  simple  way  to  those  computed  recursively  by  (23)  and  (30).  These  quantities 
could  be  computed  directly  from  p'P*1  •  The  matrix  P°  may  be  computed  from 
the  latent  roots  and  vectors  of  PP'  and  of  P'P  (Feller,  1959#  chapt.  l6, 
eq.  1.12). 

As  already  noted  in  section  19,  asymptotic  results  are  of  marginal  interest 
for  present  purposes.  Asymptotic  properties  of  b^n+1^  can  be  found  from 
Chung  (i960,  part  1,  section  12).  Although  the  b^  form  a  Markov  chain, 
the  average  difficulty  scores  do  not.  The  average  difficulty  score  is  a 
functional  of  the  Markov  chain.  Asymptotic  properties  of  such  functionals 
are  treated  by  Chung  (part  1,  sections  lU-16).  An  asymptotic  formula  for  the 
error  variance  of  the  average  difficulty  score  is  given  by  Tsutakawa  (1967a,  eq.  5). 
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